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Abstract 

We consider finite-state concurrent stochastic games, played by fc > 2 players for an infinite number of rounds, 
where in every round, each player simultaneously and independently of the other players chooses an action, whereafter 
the successor state is determined by a probability distribution given by the current state and the chosen actions. We 
consider reachability objectives that given a target set of states require that some state in the target set is visited, and 
the dual safety objectives that given a target set require that only states in the target set are visited. We are interested 
in the complexity of stationary strategies measured by their patience, which is defined as the inverse of the smallest 
non-zero probability employed. 

Our main results are as follows: We show that in two-player zero-sum concurrent stochastic games (with reach¬ 
ability objective for one player and the complementary safety objective for the other player): (i) the optimal bound 
on the patience of optimal and e-optimal strategies, for both players is doubly exponential; and (ii) even in games 
with a single non-absorbing state exponential (in the number of actions) patience is necessary. In general we study 
the class of non-zero-sum games admitting e-Nash equilibria. We show that if there is at least one player with reach¬ 
ability objective, then doubly-exponential patience is needed in general for e-Nash equilibrium strategies, whereas in 
contrast if all players have safety objectives, then the optimal bound on patience for e-Nash equilibrium strategies is 
only exponential. 


1 Introduction 

Concurrent stochastic games. Concurrent stochastic games are played on finite-state graphs by k players for an 
infinite number of rounds. In every round, each player simultaneously and independently of the other players chooses 
moves (or actions). The current state and the chosen moves of the players determine a probability distribution over the 
successor state. The result of playing the game (or a play) is an infinite sequence of states and action vectors. These 
games with two players were introduced in a seminal work by Shapley Il34ll . and have been one of the most funda¬ 
mental and well-studied game models in stochastic graph games. Matrix games (or normal form games) can model 
a wide range problems with diverse applications, when there is a finite number of interactions Concurrent 

stochastic games can be viewed as a finite set of matrix games, such that the choices made in the current game deter¬ 
mine which game is played next, and is the appropriate model for many applications ini. Moreover, in analysis of 
reactive systems, concurrent games provide the appropriate model for reactive systems with components that interact 
synchronously ifT^ [131 121. 

Objectives. An objective for a player defines the set of desired plays for the player, i.e., if a play belongs to the 
objective of the player, then the player wins and gets payoff 1, otherwise the player looses and gets payoff 0. The 
most basic objectives for concurrent games are the reachability and the safety objectives. Given a set F of states, a 
reachability objective with target set F requires that some state in F is visited at least once, whereas the dual safety 
objective with target set F requires that only states in F are visited. In this paper, we will only consider reachability and 
safety objectives. A zero-sum game consists of two players (player 1 and player 2), and the objectives of the players 
are complementary, i.e., a reachability objective with target set F for one player and a safety objective with target set 
complement of F for the other player. In this work, when we refer to zero-sum games we will imply that one player 
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has reachability objective, and the other player has the complementary safety objective. Concurrent zero-sum games 
are relevant in many applications. For example, the synthesis problem in control theory (e.g., discrete-event systems as 
considered in corresponds to reactive synthesis of BTl . The synthesis problem for synchronous reactive systems 
is appropriately modeled as concurrent games lEiiniiii. Other than control theory, concurrent zero-sum games also 
provide the appropriate model to study several other interesting problems, such as two-player poker games ll28l . 

Properties of strategies in zero-sum games. Given a zero-sum concurrent stochastic game, the player-1 value vi (s) 
of the game at a state s is the limit probability with which he can guarantee his objective against all strategies of 
player 2. The player-2 value V 2 {s) is analogously the limit probability with which player 2 can ensure his own 
objective against all strategies of player 1. Concurrent zero-sum games are determined IfT^ , i.e., for each state s we 
have ui(s) + V 2 {s) = 1. A strategy for a player, given a history (i.e., finite prefix of a play) specifies a probability 
distribution over the actions. A stationary strategy does not depend on the history, but only on the current state. For 
e > 0, a strategy is e-optimal for a state s for player i if it ensures his own objective with probability at least Vi{s) — e 
against all strategies of the opponent. A 0-optimal strategy is an optimal strategy. In zero-sum concurrent stochastic 
games, there exist stationary optimal strategies for the player with safety objectives ll^l23l : whereas in contrast, for 
the player with reachability objectives optimal strategies do not exist in general, however, for every e > 0 there exists 
stationary e-optimal strategies M- 

The significance of patience and roundedness of strategies. The basic decision problem is as follows: given a 
zero-sum concurrent stochastic game and a rational threshold A, decide whether ui(s) > A. The basic decision 
problem is in PSPACEand is square-root sum hard IT5fl Given the hardness of the basic decision problem, the next 
most relevant computational problem is to compute an approximation of the value. The computational complexity of 
the approximation problem is closely related to the size of the description of e-optimal strategies. Even for special 
cases of zero-sum concurrent stochastic games, namely turn-based stochastic games, where in each state at most one 
player can choose between multiple moves, the best known complexity results are obtained by guessing an optimal 
strategy and computing the value in the game obtained after fixing the guessed strategy. A strategy has patience p if 
p is the inverse of the smallest non-zero probability used by a distribution describing the strategy. A rational valued 
strategy has roundedness g if g is the greatest denominator of the probabilities used by the distributions describing the 
strategy. Note that if a strategy has roundedness q, then it also has patience at most q. The description complexity 
of a stationary strategy can be bounded by the roundedness. A stationary strategy with exponential roundedness, 
can be described using polynomially many bits, whereas the explicit description of stationary strategies with doubly- 
exponential patience is not polynomial. Thus obtaining upper bounds on the roundedness and lower bounds on the 
patience is at the heart of the computational complexity analysis of concurrent stochastic games. 

Strategies in non-zero-sum games and roundedness. In non-zero-sum games, the most well-studied notion of 
equilibrium is Nash equilibrium ll26ll . which is a strategy vector (one for each player), such that no player has an 
incentive of unilateral deviation (i.e., if the strategies of all other players are fixed, then a player cannot switch strategy 
and improve his own payoff). The existence of Nash equilibrium in non-zero-sum concurrent stochastic games where 
all players have safety objectives has been established in 13^ . It follows from the strategy characterization of the 
result of ll^ and our Lemma 1^ that if such strategies have exponential roundness and forms an e-Nash equilibrium, 
for a constant or even logarithmic number of players, for e > 0, then there will be polynomial-size witness for those 
strategies (and the approximation of a Nash equilibrium can be achieved in TFNP, see Remark l44l) . Thus again the 
notion of roundedness is at the core of the computational complexity of non-zero-sum games. 

Previous results and our contributions. In this work we consider concurrent stochastic games (both zero-sum and 
non-zero-sum) where the objectives of the players are either reachability or safety. We first describe the relevant 
previous results and then our contributions. 

Previous results. For zero-sum concurrent stochastic games, the optimal bound on patience and roundedness for 
e-optimal strategies for reachability objectives, for e > 0, is doubly exponential ll22l l20ll . The doubly-exponential 
lower bound is obtained by presenting a family of games (namely. Purgatory) where the reachability player requires 
doubly-exponential patience (however, in this game the patience of the safety player is 1) Il22ll20ll : whereas the doubly- 
exponential upper bound is obtained by expressing the values in the existential theory of reals ll22ll20ll . In contrast to 
reachability objectives that in general do not admit optimal strategies, similar to safety objectives there are two related 

'The square-root sum problem is an important problem from computational geometry, where given a set of natural numbers ni, 712,..., 
the question is whether the sum of the square roots exceed an integer b. The problem is not known to be in NP. 
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classes of concurrent stochastic games that admit optimal stationary strategies, namely, discounted-sum objectives, 
and ergodic concurrent games. For both these classes the optimal bound on patience and roundedness for e-optimal 
strategies, for e > 0, is exponential lfmi24l . The optimal bound on patience and roundedness for optimal and e-optimal 
strategies, for e > 0, for safety objectives has been an open problem. 

Our contributions. Our main results are as follows: 

1. Lower bound: general. We show that in zero-sum concurrent stochastic games, a lower bound on patience 
of optimal and e-optimal strategies, for e > 0, for safety objectives is doubly exponential (in contrast to the 
above mentioned related classes of games that admit stationary optimal strategies and require only exponential 
patience). We present a family of games (namely. Purgatory Duel) where the optimal and e-optimal strategies, 
for e > 0, for both players require doubly-exponential patience. 

2. Lower bound: three states. We show that even in zero-sum concurrent stochastic games with three states of 
which two are absorbing (sink states with only self-loop transitions) the patience required for optimal and e- 
optimal strategies, for e > 0, is exponential (in the number of actions). An optimal (resp., e-optimal, for e > 0) 
strategy in a game with three states (with two absorbing states) is basically an optimal (resp., e-optimal) strategy 
of a matrix game, where some entries of the matrix game depends on the value of the non-absorbing state (as 
some transitions of the non-absorbing state can lead to itself). In standard matrix games, the patience for e- 
optimal strategies, for e > 0, is only logarithmic lIZTl : and perhaps surprisingly in contrast we show that the 
patience for e-optimal strategies in zero-sum concutTent stochastic games with three states is exponential (i.e., 
there is a doubly-exponential increase from logarithmic to exponential). 

3. Upper bound. We show that in zero-sum concurrent stochastic games, an upper bound on the patience of optimal 
strategies and an upper bound on the patience and roundedness of e-optimal strategies, for e > 0, is as follows: 
(a) doubly exponential in general; and (b) exponential for the safety player if the number of value classes (i.e., 
the number of different values in the game) is constant. Hence our upper bounds on roundedness match our 
lower bound results for patience. Our results also imply that if the number of value classes is constant, then the 
basic decision problem is in C0NP(resp., NP) if player 1 has reachability (resp., safety) objective. 

4. Non-zero-sum games. We consider non-zero-sum concurrent stochastic games with reachability and safety 
objectives. First, we show that it easily follows from our example family of Purgatory Duel that if there are at 
least two players and there is at least one player with reachability objective, then a lower bound on patience for 
e-Nash equilibrium is doubly exponential, for e > 0, for all players. In contrast, we show that if all players 
have safety objectives, then the optimal bound on patience of strategies for e-Nash equilibrium is exponential, 
for e > 0 (i.e., for upper bound we show that there always exists an e-Nash equilibrium where the strategy of 
each player requires at most exponential roundedness; and there exists a family of games, where for any e-Nash 
equilibrium the strategies of all players require at least exponential patience). 

In summary, we present a complete picture of the patience and roundedness required in zero-sum concurrent stochastic 
games, and non-zero-sum concurrent stochastic games with safety objectives for all players. Also see Section lTSl for 
a discussion on important technical aspects of our results. 

Distinguishing aspects of safety and reachability. While the optimal bound on patience and roundedness we estab¬ 
lish in zero-sum concutTent stochastic games for the safety player matches that for the reachability player, there are 
many distinguishing aspects for safety as compared to reachability in terms of the number of value classes (as shown 
in Table[T]i. For the reachability player, if there is one value class, then the patience and roundedness required is linear: 
it follows from the results of 171 that if there is one value class then all the values must be either 1 or 0; and if all states 
have value 0, then any strategy is optimal, and if all states have value 1, then it follows from MM that there is an 
almost-sure winning strategy (that ensures the objective with probability 1) from all states and the optimal bound on 
patience and roundedness is linear. The family of game graphs defined by Purgatory has two value classes, and the 
reachability player requires doubly exponential patience and roundedness, even for two value classes. In contrast, if 
there are (at most) two value classes, then again the values are 1 and 0; and in value class 1, the safety player has an op¬ 
timal strategy that is stationary and deterministic (i.e., a positional strategy) and has patience and roundedness 1 m, 
and in value class 0 any strategy is optimal. While for two value classes, the patience and roundedness is 1 for the 
safety player, we show that for three value classes (even for three states) the patience and roundedness is exponential, 
and in general the patience and roundedness is doubly exponential (and such a finer characterization does not exist for 
reachability objectives). Finally, for non-zero-sum games (as we establish), if there are at least two players, then even 
in the presence of one reachability player, the patience required is at least doubly exponential, whereas if all players 
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have safety objectives, the patience required is only exponential. 


# Value classes 

Reachability 

Safety 

1 

Linear 

One 

2 

Double-exponential 

One 

3 

Double-exponential 

Exponential 

LB, Theorem l29l 

Constant 

Double-exponential 

Exponential 

UB, Corollary l34l 

General 

Double-exponential 

Double-exponential 
LB, Theorem I 20 I 
UB, Corollary l34l 


Table 1; Strategy complexity (i.e., patience and roundedness of e-optimal strategies, for e > 0) of reachability vs 
safety objectives depending on the number of value classes. Our results are bold faced, and LB (resp., UB) denotes 
lower (resp., upper) bound on patience (resp., roundedness). 

Our main ideas. Our most interesting results are the doubly-exponential and exponential lower bound on the patience 
and roundedness in zero-sum games. We now present a brief overview about the lower bound example. 

The game of Purgatory l20l is a concurrent reachability game M that was defined as an example showing 
that the reachability player must, in order to play near optimally, use a strategy with non-zero probabilities that are 
doubly exponentially small in the number of states of the game (i.e., the patience is doubly exponential). 

In this paper we present another example of a reachability game where this is the case for the safety player as well. 
The game Purgatory consists of a (potentially infinite) sequence of escape attempts. In an escape attempt one player is 
given the role of the escapee and the other player is given the role as the guard. An escape attempt consists of at most 
N rounds. In each round, the guard selects and hides a number between 1 and m, and the escapee must try to guess 
the number. If the escapee successfully guesses the number N times, the game ends with the escapee as the winner. 
If the escapee incorrectly guesses a number which is strictly larger than the hidden number, the game ends with the 
guard as the winner. Otherwise, if the escapee incorrectly guesses a number which is strictly smaller than the hidden 
number, the escape attempt is over and the game continues. 

The game of Purgatory is such that the reachability player is always given the role of the escapee, and the safety 
player is always given the role of the guard. If neither player wins during an escape attempt (meaning there is an 
infinite number of escape attempts) the safety player wins. Purgatory may be modelled as a concurrent reachability 
game consisting of N non-absorbing positions in which each player has m actions. The value of each non-absorbing 
position is 1. This means that the reachability player has, for any £ > 0, a stationary strategy that wins from each 
non-absorbing position with probability at least 1 — £ El , but such strategies must have doubly-exponential patience. 
In fact for N sufficiently large and m > 2, such strategies must have patience at least for £ = 1 — 

II 20 I . For the safety player however, the situation is simple: any strategy is optimal. 

We introduce a game we call the Purgatory Duel in which the safety player must also use strategies of doubly- 
exponential patience to play near optimally. The main idea of the game is that it forces the safety player to behave as 
a reachability player. We can describe the new game as a variation on the above description of the Purgatory game. 
The Purgatory Duel consists also of a (potentially infinite) sequence of escape attempts. But now, before each escape 
attempt the role of the escapee is given to each player with probability i, and in each escape attempt the rules are as 
described above. The game remains asymmetric in the sense that if neither player wins during an escape attempt, the 
safety player wins. 

The Purgatory Duel may be modelled as a concurrent reachability game consisting of 2N + 1 non-absorbing 
positions, in which each player has m actions, except for a single position where the players each have just a single 
action. 

Technical contribution. The key non-trivial aspects of our proof are as follows: first, is to come up with the family of 
games, namely. Purgatory Duel, where the e-optimal strategies, for e > 0, for the players are symmetric, even though 
the objectives are complementary; and then the precise analysis of the game needs to combine and extend several 
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ideas, such as refined analysis of matrix games, and analysis of perturbed Markov decision processes (MDPs) which 
are one-player stochastic games. 

Highlights. We highlight two features of our results, namely, the surprising aspects and the significance (see Sec- 
tion l7.1l for further details). 

1. Surprising aspects. The first surprising aspect of our result is the doubly-exponential lower bound for concurrent 
safety games. The properties of strategies in concurrent safety games resemble concurrent disocunted games, 
as in both cases optimal stationary strategies exist, and locally optimal strategies are optimal. We show that 
in contrast to concurrent discounted games where exponential patience suffices for concurrent safety games 
doubly-exponential patience is necessary. The second surprising aspect is the lower bound example itself. The 
lower bound example is obtained as follows: (i) given Purgatory we first obtain simplified Purgatory by changing 
the start state such that it deterministically goes to the next state; (ii) we then consider its dual where the roles of 
the players are exchanged; and (iii) Purgatory duel is obtained by merging the start states of simplified Purgatory 
and its dual. Both in simplified Purgatory and its dual, there are only two value classes, and positional optimal 
strategies exist for the safety player. Surprisingly we show that a simple merge operation gives a game with 
linear number of value classes and the patience increases from 1 to doubly-exponential. Finally, the properties 
of strategies in concurrent reachability and safety games differ substantially. An important aspect of our lower 
bound example is that we show how to modify an example for reachability game to obtain the result for safety 
games. 

2. Significance. Our most important results are the lower bounds, and the main significance is threefold. First, 
the most well-studied way to obtain computational complexity result in games is to explicitly guess strategies, 
and then verify the game obtained fixing the strategy. The lower bound for concurrent reachability games by 
itself did not rule out that better complexity results can be obtained through better strategy complexity for safety 
games (indeed, for constant number of value classes, we obtain a better complexity result than known before 
due to the exponential bound on roundedness). Our doubly-exponential lower bound shows that in general the 
method of explicitly guessing strategies would require exponential space, and would not yield NP or CONP 
upper bounds. Second, one of the most well-studied algorithm for games is the strategy-iteration algorithm. 
Our result implies that any natural variant of the strategy-iteration algorithm for the safety player that explicitly 
compute strategies require exponential space in the worst-case. Finally, in games, strategies that are witness to 
the values and specify how to play the game, are as important as values, and our results establish the precise 
strategy complexity (matching upper bound of roundedness with lower bounds of patience). 

Related work. We have already discussed the relevant related works such as ll30l|23l[l6l[l5l|22ll20l[Tl on zero-sum 
games. We discuss relevant related works for non-zero-sum games. The computational complexity of constrained 
Nash equilibrium, which asks the existence of Nash (or e-Nash, for e > 0) equilibrium that guarantees at least a payoff 
vector has been studied. The constrained Nash equilibrium problem is undecidable even for turn-based stochastic 
games, or concurrent deterministic games with randomized strategies Esna. The complexity of constrained Nash 
equilibrium in concurrent deterministic games with pure strategies has been studied in 0121. In contrast, we study 
the complexity of computing some Nash equilibrium in randomized strategies in concurrent stochastic games, and our 
result on roundedness implies that with safety objectives for all players the approximation of some Nash equilibrium 
can be achieved in TFNP. 

2 Definitions 

Other number. Given a number i G {1, 2} let i be the other number, i.e., if i = 1, then i = 2 and if j = 2, then i = 1. 
Probability distributions. A probability distribution d over a finite set Z, is a map d : Z ^ [0, 1], such that 
d{z) = 1. Fix a probability distribution d over a set Z. The distribution d is pure (Dirac) if d{z) = 1 for 
some z G Z and for convenience we overload the notation and let d = z. The support Supp(d) is the subset Z' of Z, 
such that z G Z' \f and only if d{z) > 0. The distribution d is totally mixed if Supp(d) = Z. The patience of d is 
max^ggupp(d) dfiG)’ inverse of the minimum non-zero probability. The roundedness of d, if d{z) is a rational 

number for all z G Z, is the greatest denominator of d{z). Note that roundness of d is always at least the patience of 
d. Given two elements z, z' G Z, the probability distribution d— U( 2 :, z') over Z is such that d(z) = d{z') = Let 
A(Z) be the set of all probability distributions over Z. 
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Concurrent game structure. A concurrent game structure for k players, consists of (1) a finite set of states S, 
of size N\ and (2) for each state s G S and each player i a set A* of actions (and A® = (Jj, A* is the set of all 
actions for player i, for each z; and A = A* is the set of all actions) such that A* consists of at most m actions; 
and (3) a stochastic transition function 6 : S x x A^ x ■ • • x A^ —>■ A(S'). Also, a state s is deterministic if 
S{s, tti, 02 ,..., Ofe) is pure (deterministic), for all S A* and for all i. A state s is called absorbing if A\ = {a} for 
all i and S{s,a,a,... ,a) = s. The number (5min is 


min ((5(s,ai,a2,...,afc)(s')) , 

s,ai,...,afc,s'gSupp(5(s,ai,a2,...,afc)) 

i.e., the smallest non-zero transition probability. 

Safety and reachability objectives. Each player i, who has a safety or reachability objective, is identified by a parr 
(fi, S'®), where ti G {Reach, Safety} and S® C S. 

Concurrent games and how to play them. Fix a number k of players. A concuri'ent game consists of a concuri'ent 
game structure for k players and for each player i a pair {ti, S®), identifying the type of that player. The game G, 
starting in state s, is played as follows: initially a pebble is placed on vq := s. In each time step T > 0, the pebble 
is on some state vt and each player selects (simultaneously and independently of the other players, like in the game 
rock-paper-scissors) an action G A}.^. Then, the game selects vt+i according to the probability distribution 

S(vt, Ut+i, Ot-i-I’ ■ • ■ ’ ®T-i-i) moves the pebble onto vt+i- The game then continues with time step T + 1 (i.e., 

the game consists of infinitely many time steps). For a round T, let ax+i be the vector of choices of the actions for 
the players, i.e., {aT+i)i is the choice of player i, for each i. Round 0 is identified by vq and round T > 0 is then 
identified by the pair (ut, vt)- A play Pg, starting in state vq = s, is then a sequence of rounds 

{vo, {ai,vi), ( 02 , V 2 ), ..., (OT, Vt), ■■■) , 

and for each £ a prefix of of length £ is then 

{vo, (ai, ui), ( 02 , U 2 ), ■ • •, (ar, vt), ■■■, {at, vi)) , 

and we say that ends in vi. For each i, player i wins in the play Pg, if ti = Safety and vt G Si for all T > 0; 
or if ti = Reach and vt G Si, for some T > 0. Otherwise, player i loses. For each i, player i tries to maximize the 
probability that he wins. 

Strategies. Fix a player i. A strategy is a recipe to choose a probability distribution over actions given a finite prefix of 
a play. Formally, a strategy di for player z is a map from P{, for a play Pg of length £ starting at state s, to a distribution 
over A}^. Player z follows a strategy di, if given the current prefix of a play is Pg, he selects ag+i according to di {Pg), 
for all plays Pg starting at s and all lengths £. A strategy di for player z, is stationary, if for all £ and £', and all pair of 
plays Pg and P{,, starting at states s and s' respectively, such that Pg and (P')s' same state t, we have that 

^i{Ps) — '^i{{P')i')’ ^nd we write di{t) for the unique distribution used for prefix of plays ending in t. The patience 
(resp., roundedness) of a strategy di is the supremum of the patience (resp., roundedness) of the distribution di{Pg), 
over all plays Pg starting at state s, and all lengths £. Also, a strategy di is pure (resp., totally mixed) if di{Pg) is pure 
(resp., totally mixed), for all plays Pg starting at s and all lengths £. A strategy is positional if it is pure and stationary. 
For each player z, let S® be the set of all strategies for the respective player. 

Strategy profiles and Nash equilibria. A strategy profile d = {di)i is a vector of strategies, one for each player. A 
strategy profile d defines a unique probability measure on plays, denoted Piv, when the players follow their respective 
strategies 1^ . Fet u{G, s, d, i) be the probability that player z wins the game G when the players follow d and the 
play starts in s (i.e., the utility or payoff for player z). Given a strategy profile d — {di )i and a strategy cr' for player z, 
the strategy profile d[d'i\ is the strategy profile where the strategy for player z is cr' and the strategy for player j is dj 
for j f- i. Fix a state s and e > 0. A strategy profile cr forms an e-Nash equilibrium from state s if for all z and 
all strategies cr' for player z, we have that u{G, s, cr, z) > u{G, s, cr[cr'], z) — e. A strategy profile cr forms an e-Nash 
equilibrium if it forms an e-Nash equilibrium from all states s. Also a strategy profile forms a Nash equilibrium (resp., 
from state s, for some s) if it forms a 0-Nash equilibrium (resp., from state s). We say that a strategy profile has a 
property (e.g., is stationary) if each of the strategies in the profile has that property. 
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2.1 Zero-sum concurrent stochastic games 

A zero-sum game consists of two players with complementary objectives. Since we only consider reachability and 
safety objectives, a zero-sum concurrent stochastic game consists of a two-player concurrent stochastic game with 
reachability objective for player 1 and the complementary safety objective for player 2 (such a game is also referred 
to as concurrent reachability game). 

Concurrent reachability game. A concurrent reachability game is a concurrent game with two players, identified by 
(Reach, and (Safety, S \ S^). Observe that in such games, exactly one player wins each play (this implies that 
the games are zero-sum). Note that for all strategy profiles tr we have u{G, s, a, 1) -I- u{G, s, a, 2) = 1. For ease of 
notation and tradition, we write u{G, s, CTi, 172 ) for u{G, s, CTi, (T 2 ,1), for all concurrent reachability games G, states 
s, and strategy profiles a = (cti , (T 2 ). Also if the game G is clear from context we drop it from the notation. 

Values of concurrent reachability games. Given a concurrent reachability game G, the upper value of G starting in 

s is 

val(G, s) = sup inf tt(G, s, cti, CT 2 ) ; 

o-iCEi o- 26 S= 

and the lower value of G starting in s is 

vd(G, s)= inf sup u(G, s, (Ti, 0 - 2 ) . 

As shown by m we have that 

val(G, s) := val(G, s) = vd(G, s) ; 

and this common number is called the value of s. We will sometimes write val(s) for val(G, s) if G is clear from the 
context. We will also write val for the vector where val^ = val(s). 

(e-)optimal strategies for concurrent reachability games. For an e > 0, a strategy ai for player 1 (resp., 02 for 
player 2) is called e-optimal if for each state s we have that val(s) — e < info-jesa u{s, cti, 172 ) (resp., val(s) -f e > 
sup^^ r((s, CTi, (72))- For each i, a strategy ai for player i is called optimal if it is 0-optimal. There exist concurrent 
reachability games in which player 1 does not have optimal strategies, see ifThl for an exampl^l. On the other hand 
in all games G player 1 has a stationary e-optimal strategy for each e > 0. In all games player 2 has an optimal 
stationary strategy (thus also an e-optimal stationary strategy for all e > 0) ll^l2^ . Also, given a stationary strategy 
ai for player 1 we have that there exists a positional strategy a 2 , such that u(s, ui, (T 2 ) = info-'gE 2 u(s, ai,a 2 ), i.e., 
we only need to consider positional strategies for player 2. Similarly, we only need to consider positional strategies 
for player 1, if we are given a stationary strategy for player 2. 

(e-)optimal strategies compared to (e-)Nash equiUbria. It is well-known and easy to see that for concurrent reach¬ 
ability games, a strategy profile cr = (o'i,i72) is optimal if and only if cr forms a Nash equilibrium. Also, if cti is 
e-optimal and <72 is e'-optimal, for some e and e', then a = (ci, (72) forms an (e-l-e')-Nash equilibrium. Furthermore, 
if (7 = ((7i, (72) forms an e-Nash equilibrium, for some e, then ai and a 2 are e-optimajl 

Markov decision processes and Markov chains. For each player i, a Markov decision process (MDP)for player i is 
a concurrent game where the size of is 1 for all s and j ^ i. A Markov chain is an MDP for each player (that is 
the size of A{ is 1 for all s and j). A closed recurrent set of a Markov chain G is a maximal (i.e., no closed recurrent 
set is a subset of another) set S' C S such that for all pairs of states s, s' G S, the play starting at s reaches state s' 
eventually with probability 1 (note that it does not depend on the choices of the players as we have a Markov chain). 
For all starting states, eventually a closed recurrent set is reached with probability 1, and then plays stay in the closed 
recurrent set. Observe that fixing a stationary strategy for all but one player in a concurrent game, the resulting game 
is an MDP for the remaining player. Hence, fixing a stationary strategy for each player gives a Markov chain. 

^note that it is not because that we require the strategy to be optimal for each start state, since if there was one for each start state separately then 
there would be one for all, since this is not just for stationary strategies 

^observe that the two latter properties implies the former, but all are included to make it clear that there is a strong connection 
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2.2 Matrix games and the value iteration algorithm 

A (two-player, zero-sum) matrix game consists of a matrix M G We will typically let M refer to both the 

matrix game and the matrix and it should be clear from the context what it means. A matrix game M is played as 
follows: player 1 selects a row oi and at the same time, without knowing which row was selected by player 1, player 2 
selects a column 02 - The outcome is then Player 1 then tries to maximize the outcome and player 2 tries to 

minimize it. 

Strategies in matrix games. A strategy cti (resp., (T 2 ) for player 1 (resp., player 2 ) is a probability distribution over 
the rows (resp., columns) of M. A strategy profile a = (ui, 172 ) is a pair of strategies, one for each player. Given a 
strategy profile a — (cti, 172) the payoff u{M, CTi, 172) under those strategies is the expected outcome if player 1 picks 
row oi with probability (7i(ai) and player 2 picks column 02 with probability 172 ( 02 ) for each oi and 02 , i.e., 

uiM,ai,a2) = EE Ma, 

,0,2 ■ o-i(ai) • 0-2(02) ■ 

0,1 02 


Values in matrix games. The upper value of a matrix game is val(M) = sup^^ info-j u{M, ai,a 2 )- The lower value 
of a matrix game is yd(M) = inf^a sup^^ u{M, ai,a 2 ). One of the most fundamental results in game theory, as 
shown by iJTl . is that val(M) := val(M) = yal(M). This common number is called the value. 

(e-)optimal strategies in matrix games. A strategy ai for player 1 is e-optimal, for some number e > 0 if val(M) — 
e < info-a u{M, ai, 0-2). Similarly, a strategy 0-2 for player 2 is e-optimal, for some number e > 0 if val(M) -f e > 
sup^j u{M, o-i, 0 - 2 ). A strategy is optimal if it is 0-optimal. There exists an optimal strategy for each player in all 
matrix games El. Given an optimal strategy oi for player 1, consider the vector v, such that Vj = u{M, cti, j) for 
each column j. Then we have that Vj = val(M) for each j such that there exists an optimal strategy 172 for player 2, 
where 0 - 2 (j) > 0. Similar analysis holds for optimal strategies of player 2. This also shows that given an optimal 
strategy cti for player 1 we have that u{M, cti, (T 2 ) is minimized for some pure strategy a 2 and similarly for optimal 
strategies a 2 for player 2. Given a matrix game M, an optimal strategy for each player and the value of M can be 
computed in polynomial time using linear programming. 

The matrix game A® [rJ] and A®. Fix a concurrent reachability game G. Given a vector v in and a state s (in G), 
the matrix game A®[F] = [aij] is the matrix game where Oij = X)s'es j)('SO'Given a state s, the matrix 
game A® is the matrix game A®[val]. As shown by each optimal stationary strategy 72 for player 2 in G is 

such that for each state s the distribution 0-2 (s) is an optimal strategy in the matrix game A®. Also, conversely, if 0-2 (s) 
is an optimal strategy in A® for each s, then a 2 is an optimal stationary strategy in G. Furthermore, also as shown 
by ll^l23l . we have that val(s) = val(A®) for each state s. 

The value iteration algorithm. The conceptually simplest algorithm for concurrent reachability games is the value 
iteration algorithm, which is an iterative approximation algorithm. The idea is as follows: Given a concurrent reacha¬ 
bility game G, consider the game G‘ where a time-limit t (some non-negative integer) has been introduced. The game 
G* is then played as G, except that player 2 wins if the time-limit is exceeded (i.e., he wins after round t unless a state 
in has been reached before that). (The game G‘ has a value like in the above definition of matrix games since the 
game only has a finite number of pure strategies and thus can be reduced to a matrix game). The value of G* starting 
in state s then converges to the value of G starting in s as f goes to infinity as shown by ifTbll . More precisely, the 
algorithm is defined on a vector v* which is the vector where F* is the value of G* starting in s. We can compute F* 
recursively for increasing t as follows 


r 1 if s G S'! 

F* = < 0 if s ^ 5^ and t = 0 

[val(A®[F*“^]) if s ^ 5'^ and < > 1 . 


We have that u* < < val(s) for all t and s, and for all s we have limt_>oo vl = val(s), as shown by ifTbl. As 

shown by ll20ll2T1l the smallest time-limit t such that F* > val(s) — e can be as large as ^ for some games (of 
N states and at most m actions in each state for each player) and s, for e > 0. On the other hand it is also at most 
shown by ll20l . 
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3 Zero-sum Concurrent Stochastic Games: Patience Lower Bound 


In this section we will establish the doubly-exponential lower bound on patience for zero-sum concurrent stochastic 
games. First we define the game family, namely, Purgatory Duel and we also recall the family Purgatory that will 
be used in our proofs. We split our proof about the patience in Purgatory Duel in three parts. First we present some 
refined analysis of matrix games, and use the analysis to first prove the lower bound for optimal strategies, and then 
for e-optimal strategies, for e > 0. 

The Purgatory Duel. In this paper we specifically focus on the following concurrent reachability game, the 
Purgatory Dwefl. defined on a pair of parameters The game consists of TV = 2n -f 3 states, namely 

{vl,vl, ..., v^, Vi,V2, ■ ■ ■, Vs, T, _L} and all but Vg are deterministic. To simplify the definition of the game, 
let Vq = u^_|_i = -L and Vq = = T. The states T and _L are absorbing. For each i G {1, 2} and j G {1,.. ., n}, 

the state v) is such that ={1,2,..., m} and for each oi, 02 we have that 

J V ■ V- 


{ Vs if ai > 02 

Vq if ai < 02 

if Oi =02 . 

Finally, A],^ = A^^ = {a} and d{vs,o,a) = U(i'{,v{). Furthermore, = {T}. There is an illustration of the 
Purgatory Duel with m = n = 2 in Figure |6] 

The game Purgatory. We will also use the game Purgatory as defined by ll20l (and also in i23i for the case of 
m = 2). Purgatory is similar to the Purgatory Duel and hence the similarity in names. Purgatory is also defined on a 
pair of parameters (n, m). The game consists of N = n + 2 states, namely, {vi,V 2 , ■ ■ ■, Vn, T, _L} and each state is 
deterministic. To simplify the definition of the game, let Vn+i = T. For each j G (1,..., n}, the state vj is such that 
= {l,2,...,m} and for each oi , 02 we have that 

{ vi if ai > 02 

_L if ai < 02 

Vj+i if oi= 02 . 

The states T and _L are absorbing. Furthermore, = {T}. There is an illustration of Purgatory with m = n = 2 in 
Figure |2] 


3.1 Analysis of matrix games 


In this section we present some refined analysis of some simple matrix games, which we use in the later sections to 
find optimal strategies for the players and the values of the states in the Purgatory Duel. 

Definition 1. Given a positive integer m and reals x, y and z, let be the (m x m)-matrix with x below the 

diagonal, y in the diagonal and z above the diagonal, i.e.. 


/y z z 

X y z 


j^x,y,z,m _ 


X 


A 


X \ ■- y z 

\x X ... X yj 


We first explain the significance of the matrix game in relation to Purgatory Duel. Consider the Purgatory 

Duel defined on parameters (n, m), for some n. We will later establish that for any j, let v (resp., v') be state vj (resp., 

^To allow a more compact notation, we have here exchanged the criterias for when the safety player wins as a guard and when the escape attempt 
ends, as compared to the textual description of the game given in the introduction. 
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Figure 1: An illustration of the Purgatory Duel with m = n = 2. The two dashed edges have probability ^ each. 
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Figure 2: An illustration of Purgatory with m = n = 2. 

v]) of the Purgatory Duel, then we have that A" = (resp., A’’ = In 

this section we show that for 0 < z < j/ we have that M = is such that val(M) > z and each optimal 

strategy for either player is totally mixed. Similarly, for 1 > z' > y' we show that M' = is such that 

val(M') < z and each optimal strategy for either player is totally mixed. We also compute the value and the patience 
of each optimal strategy in the matrix game 2 +^> 2 (since we will establish in the next section, using the results 
of this section, that val(us) = h and val(w^) > val(s) for all j). 


Lemma 2. For all positive integers m and reals y and z such that 0 < z < y, the matrix game M = has 

value strictly above z. 


Proof. Let e > 0 be some number to be defined later. Consider the probability distribution erf given by 


alia) 


i_^a ifl<a<m—1 
^m-i if a = m . 


If player 2 plays column a against cti, for a < m — 1, then the payoff u(M, cti, a) is y • (e““^ — e“) + y • (1 — 
and if player 2 plays column m, then the payoff u(M, ui, m) is y • + z • (1 — For any e such that 

y • (1 — e) > z, the payoff is strictly greater than z implying that the value of M is strictly greater than z. □ 

Lemma 3. For all positive integers m and reals y and z such that 0 < z < y, each optimal strategy for player 1 in 
the matrix game is totally mixed. 

Proof. Consider some strategy ai for player 1 in which is not totally mixed. Thus there exists some row a, 

where ai (a) = 0. Consider the pure strategy a 2 that plays column a with probability 1. Playing cri against a 2 ensures 
that each outcome is either z or 0, i.e., the payoff is at most z which is strictly less than the value by Lemma|2l □ 

Lemma 4. For all positive integers m and reals y and z such that 0 < z < y, each optimal strategy for player 2 in 
the matrix game M = is totally mixed. 
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Proof. Given a strategy cri for player 1 and two rows a' and a", let the strategy cti [o' —5> a"] be the strategy where the 
probability mass on a' is moved to a", i.e., 

f tTi (a) if a' f a a" 

CTi [a' —!> a"] (a) = s 0 if a = a' 

[tTi(a') + cri(a") if a = a" . 

Consider some optimal totally mixed strategy cti for player 1, which exists by Lemma |3] and let v be the value of 
M. Consider some strategy CT 2 for player 2 such that u{M, ai,a 2 ) = v, but CT 2 is not totally mixed. We will argue that 
CT 2 is not optimal. This shows that any optimal strategy is totally mixed, since any optimal strategy CT 2 is such that 
u{M, CTl, 0 - 2 ) = V- 

Let b' be the hrst column such that CT 2 (&) = 0. There are two cases, either b' = 1 or b' > 1. If 6' = 1 let b" be the 
hrst action such that cr 2 (b") > 0. Let a'l = cri[6' —>■ b"]. The payoff u(M, tri, CT 2 ) of playing trj. against CT 2 is strictly 
more than the payoff u(M, cti, (T 2 ) of playing cti against 0-2 ■ This is because the payoff it(M, af b") is such that 

b"-l 

u{M,a[,b") =cr[{b") - y + z- ^ a[{a) 

a—1 

= (T[{b”) - y + z- <T[{a) 

a^2 

6"-l 

= {cri{b") + cri(l)) - y + z- Y 

a—2 

b"-l 

> criib") - y + z- Y o'i(«) 

a—1 

= u{M,ai,b'') , 

where the second equality comes from that CTi( 1) = 0. The inequality comes from that y > z. Also, the payoff 
u{M, a[,b), for b > b" is such that 


b-l 

u{M,cr[,b) = cr[{b) ■ y + z ■ Y 

a—1 

b-l 

= cri(6) • y + z • ^cri(a) = , 

a—1 

because a[ is not different from ai on those actions. We can then hnd the payoff u{M, a[, 172) as follows 

m 

u{M,a[,a 2 ) = E ( 72 {b) ■ u(M, a'l, b) 

b^l 


m 

= Y ^ 2 (^) • u{M,a[,b) 

b^b" 

m 

= a2{b") ■ u{M,a[,b") + Y^ cr2{b) ■ u{M,a[,b) 

b=b"+i 


m 

> a2{b") ■ u{M,ai,b") + Y, <^2{b) ■ u{M,ai,b) 

b=b"+i 


= u{M,ai,a2) , 
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where the second equality comes from that b" is the first action 172 plays with positive probability. Since the payoff 
u{M, CTi, ( 72 ) is the value, by definition of a 2 , and the payoff u{M, a[,a 2 ) is strictly more, the strategy <72 cannot be 
optimal. This completes the case where b' = 1. 

The case where b' ^ 1 follows similarly but considers a" = ai [b' —^ 1] instead of cr^. □ 

Lemma 5. For all positive integers m and 0 < £ < i, the matrix game M = has the following 

properties: 

• Property 1. The patience of any optimal strategy is (i) at least and (ii) decreasing in e. 

• Property 2. The value is (i) at most \ + s ■ (2£)'"“^ and (ii) increasing in e. 

• Property 3. Any optimal strategy cti for player 1 (resp., cj 2 for player 2) is such that (Ti(1) > ^ (resp., 
CT2(m) > ij. 

• Property 4. For £ = ^, the value is val(M) = ^ + 2 ^+i --2 patience of any optimal strategy is 2™ — 1. 

Proof. Let ai be an optimal strategy for player i in M, for each i. By Lemma[3]and Lemma|4]the strategy ai is totally 
mixed for each i. We can therefore consider the vector v. Recall that Vj = u(M, ai,j) and that for each j such that 
^ 2 ij) > 0 we have that Vj = val(M). Hence, since CT 2 is totally mixed, all entries of M are val(M). For any row 
a' < m, that Va' = Va'+i implies that 


a —1 


a—1 

= (2 + ^) ■ +1) + 2 ’ ^^ 

a—1 

£ • cri(a') = (i + £) • ai(a' + 1 ) ^ 

tTi(a') = ^ • cri(a' + 1) , 

£: 

indicating that (Ji(a') > ai{a' + 1) and thus the patience is \/ai{m). Also, since ai is a probability distribution 


m 

1 = XCTi(a) 

a—1 



We then get that 


CTi(m) 


1 


E m 
a— 


1 



m—a 


We have that ^ ^ decreasing in e. This indicates that ai (m) is increasing in e and thus the patience is 

decreasing in £. This shows (ii) of Property 1 for player 1. We also have that val(M) = Vm indicating that 


val(M) = Vm 

1 

= Mm) ■ (2 + 2 ' ^ 

a—1 

/ N 1 

= £-cri(m) + - 

and thus, the value is increasing in e (because e and fTi(TO) both are). This shows (ii) of Property 2. 
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Also, we get that. 


1 


(Ti(m) 


( 1 I \ 

4 ^) 

g.m-1 


^m-1 




where p is some polynomial of degree m — 1 in which all terms have a positive sign (p is found by multiplying out 
+ £)™““ • Hence, we have that CTi(m) is at most 


CTi (m) 


+ e ■ p{e) 


< (2£)™-i 


Thus, the patience is at least (2e)“"*+^. This shows (i) of Property 1 for player 1. Using that val(M) = e ■ cri (m) + i 
from above, we get that val(M) < i + e • (2£)™“^. This shows (i) of Property 2. 

Furthermore, we can also consider the vector v' such that v'j = u{M,j, < 72 ) for all j (which like v has all entries 
equal to val(M)). Since the expression, when CT 2 is taken to be an unknown vector, for the j’th entry of v' is the same 
as for the m + 1 — j’th entry of u, when cti is taken to be an unknown vector, we see that ai{a) = (T 2 (to + 1 — a), 
implying that the patience of player 2’s optimal strategies is also at least (2£)“™+^ and that it is decreasing in e. This 
shows Property 1 for player 2. 

Observe that since the value is above i, by Lemma |2] we have that cri(l) > ^ (because otherwise, if player 2 
plays 1 with probability 1, the payoff will not be above i) and thus also a 2 {m) > This shows Property 3. 

Also, for £ = i we see that 


CTi (m) 


1 


Om—a 
0=1 ^ 

1 

2™ -1 


Similarly to above, we also get that 172 (w) = val(M) = 5 + 2 m +\_2 ■ This shows Property 4 and 

completes the proof. □ 

Lemma 6. Given a positive integer m and reals y and z such that \ > z > y, the matrix game M = has 

the following properties: 

• The value val(M) < z. 

• Each optimal strategy Gi for player i is such that there exists an optimal strategy aq for player i in 

where ai(j) = — j + 1). 


Proof Let a positive integer m and reals y and z such that 1 > z > y be given. Consider M and let v be the value of 
M. Exchange the roles of the players by exchanging the rows and columns and multiply the matrix by —1. We get the 
matrix 


f-y 

-1 -1 

... -1\ 

—z 

-y -1 

... -1 


—z 


—z 


-y -1 


—z 

1 

1 


14 













We then have that each optimal strategy cti in M is an optimal strategy for player 2 in and similarly, each optimal 
strategy cr 2 for player 2 in M is an optimal strategy for player 1 in (and vice versa). Also, the value vi of is 

Vi := —V. 

Let m 2 be the matrix where M^ ^ i.e.. 


m2 


f-y -z -z 
-1 -y -z 

: -1 ■■■ 


-A 

—z 


-1 : -y -z 

V-1 -1 ... -1 -y) 


For each i, and for any optimal strategy di for player i in M^ the strategy tr' is optimal for player i in M2, where 
(j[{a) = (Ji{m + 1 — a) for each a (and vice versa). Also, the value V 2 of M'^ is t >2 := vi = —v. 

Next, let M^ be the matrix M'^ where we add 1 to each entry, i.e.. 


M^ 


/l — y 1 — z 1 — z 
0 1 — 2/ 1 — z 

: 0 


l-z\ 

1-z 


0 : 

V 0 0 


l-y 1-z 

0 1-yJ 


For each i, it is clear that an optimal strategy in di for player i in M'^ is an optimal strategy for player i in M^ and that 
the value 'Ua is z ;3 := 1 + '(;2 = 1 — v. Also, we see that M^ = and that 0<1 — z<l — y. 

We then get that 1 — u > 1 — z from Lemmal^and thus v < z. □ 


3.2 The patience of optimal strategies 

In this section we present an approximation of the values of the states and the patience of the optimal strategies in the 
Purgatory Duel. We hrst show that the values of the states (besides T and _L) are strictly between 0 and 1. 

Lemma 7. Each state 

V e {vl,vl,...,vl,vf,vl,...,vl,vs} 
is such thatval{v) G [ ^n+2 1 1 ~ rn^+2 ] 

Proof. Fix V € {uj, t;|,..., t;2, ..., U 2 , r’s}- The fact that val(u) > follows from that if player 1 plays 

uniformly at random all actions in every state for all i, j, then against all strategies for player 2 there is a probability 
of at least to go (1) from uj to "uj+i, for all j; and (2) from Vg to v\', and (3) from Vj to Vg, for all j. By following 
such steps for at most n + 2 steps, the state = T is reached. Similarly that val(u) < 1 — —follows from 
player 2 playing uniformly at random all actions in every state uj for all i,j (and using that T cannot be reached from 
A). □ 


Next we show that every optimal stationary strategy for player 2 must be totally mixed. 

Lemma 8. Let d 2 be an optimal stationary strategy for player 2. The distribution d 2 (u* ) is totally mixed and val(uj) > 
val(us) > val(u2),/or aW i,j. 

Proof Let v = u] for some i,j. We will use that val(u) = val(A’'). For i = 1 we have that A"" = 
j\^o,val(i,]+i).val(«.).m for 2 = 2 we have that A'" = MTval(«|+i).val(i>,),m^ 

Consider first i = 1. We will show using induction in j (with base case j = n and proceeding downwards), that 
val(z;j) > val(us) and that the distribution d 2 {Vj) is totally mixed. 

Base case, j = n: We have that A" = jVfOd.val(i;s),m gy Lemma|2]we have that 1 > va^Us) > 0 and thus, that 
val(ri) > val(us) follows from Lemma|2l That d 2 {v) is totally mixed follows from Lemma|4] 
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Induction case, j < n — 1: We have that = jVfO>val(j;)^i),val(« 3 ),™ gy Lemma|7]we have that val('(;s) > 0 
and by induction we have that val(z;j_|_]^) > va^Us) and thus, that val(u) > val('(;s) follows from Lemma|2] That (72 (u) 
is totally mixed follows from LemmaH) 

The argument for i = 2 is similar but uses Lemma |6] together with Lemma[3 instead of Lemma|4]and Lemma|2] 

□ 

Next, we show that if either player follows a stationary strategy that is totally mixed on at least one side (that is, if 
there is an i', such that for each j the stationary strategy plays totally mixed in u® ), then eventually either T or _L is 
reached with probability 1. 

Lemma 9. For any i and i', let ai be a stationary strategy for player i, such that cFi(vj ) is totally mixed for all j. Let 
(Tj be some positional strategy for the other player. Then, each closed recurrent set in the Markov chain defined by the 
game and ai and consists of only the state T or only the state _L. 

Proof In the Markov chain defined by the game and ai and a-^, we have that there are at most two closed recurrent 
sets, namely, the one consisting of only T and the one consisting of only _L. The reasoning is as follows; If either T 
or _L is reached, then the respective state will not be left. Also, for each j, since ai is totally mixed there is a positive 
probability to go to either Uq or from u® (the remaining probability goes to Vg). The probability to go from Vg 
to v\ in one step is Also if neither T nor _L has been reached, then Vg is visited after at most n + 1 steps. Hence, 
in every n + 1 steps there is a positive probability that in the next n + 1 steps either T or _L is reached (i.e., from Vg 
there is a positive probability that the next states are either (i) uj ,..., u® , Ug ; or (ii) uj ,..., This shows 

that eventually either T or _L is reached with probability 1. □ 

Remark 10. Note that Lemma\^only requires that the strategy ai is totally mixed on one “side” of the Purgatory 
Duel. For the purpose of this section, we do not use that it only requires one side to be totally mixed, since we only use 
the result for optimal strategies for player 2, which are totally mixed by Lemma^ However the lemma will be reused 
in the next section, where the one sidedness property will be useful. 

The following definition basically “mirrors ” a strategy ai for player i, for each i and gives it to the other player. 
We show (in Lemma fT^ that if 172 is optimal for player 2, then the mirror strategy is optimal for player 1. We also 
show that if CT 2 is an e-optimal strategy for player 2, for 0 < e < |, then so is the mirror strategy for player 1 (in 
LemmafTSIl. 

Definition 11 (Mirror strategy). Given a stationary strategy ai for player i, for either i, let the mirror strategy af' for 
player i be the stationary strategy where (u® ) = ai (u® ) for each i' and j. 

We next show that player 1 has optimal stationary strategies in the Purgatory Duel and give expressions for the 
values of states. 

Lemma 12. Let ai be some optimal stationary strategy for player 2. Then the mirror strategy a'f^ is optimal for 
player 1. We have val(us) = \ and va^up = 1 — val(u®),/or all i, j. 

Proof. Consider some optimal stationary strategy ai for player 2. It is thus totally mixed, by Lemma[8] Let ai = a^'^ 
be the mirror strategy for player 1. 

Playing cti against 172 and starting in Vg we see that we have probability i to reach T and probability i to reach 
_L, by symmetry and Lemma |9] This shows that the value is at least ^ because ai is optimal. On the other hand, 

consider some stationary strategy a[ for player 1, and the mirror strategy a'l = a^^ for player 2. If player 2 plays 
72 against a{, then the probability to eventually reach _L is equal to the probability to eventually reach T and then 
there is some probability p (perhaps 0) that neither will be reached. The payoff ^(us, cr^, cr^, 1) is then < i. This 
shows that player 1 cannot ensure value strictly more than which is then the value of Vg. Finally, we argue that ai 

is optimal. If not, then consider a^ such that m(us, i7i, cr^, 1) < 1/2, and then the mirror strategy a^ = a^^ ensures 
that u(vs, af ai, 1) > 1/2 contradicting optimality of 172 ■ 

Similarly, for any i, j, playing ai against 172 and starting in u® we see that the probability with which we reach T 

is equal to the probability of reaching _L starting in u® and vice versa, by symmetry. Also, by Lemma|9]the probability 
to eventually reach either _L or T is 1. Observe that the probability to reach _L starting in u® is at least 1 — val(u®), by 
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optimality of and that with probability 1 either _L is reached or T is reached. Also, again because a-i is optimal, the 
probability to reach T starting in u] is at most val(u]). This shows that val(u]) > 1 — val(r;]). Using an argument like 
the one above, we obtain that val(z;p = 1 — val(up and that tJi is optimal if the play starts in u* . □ 

Finally, we give an approximation of the values of states in the Purgatory Duel and a lower bound on the patience 
of any optimal strategy of ™ 

Theorem 13. For each j in {1,, n}, the value of state uj in the Purgatory Duel is less than ^ 

and for any optimal stationary strategy at for either player i, the patience of ai{v^) is at least ™ ’’ . 

Proof Consider some optimal stationary strategy a 2 for player 2. We will show using induction in j that val(uj) is 
less than ^ and that the patience of <72('^j) is at least ^ . Note that using LemmafT2l a 

similar result holds for optimal strategies for player 1. Let v = Vj. 

Base case, j = n: We see that the matrix is and thus, by Lemma|3 (Property 1 and 2) we have that 

the value 


val(u) = val(A'") 

1 ^ 1 
- 2 + 2™+i - 2 

< i + 2-™ 

_ i' I o(l——1 

“ 2 ^ 


and a 2 {v) has patience 2"* — 1 > 2^'" m \ 

Induction case, j < n — 1: We see that the matrix A’' is M = By induction we have that 

val(u*_|_]^) < 5 + ^ Let e = 2(1“™) ’""' ^ ”^“1 and consider M' = By Lemma|5] 

(Property 1 and 2) we get that val(M') > val(M) and that the patience of M' is smaller than the one for M. Also, we 
get that 


val(M') <\+e- (2e)’"“^ 

__j_ 2^~i . 

_ I o(l——1 

"2 + 


and that the patience of M' (and thus M) is at least 


_ 2’"“! ■ 2(1“™)^ ’"" ^ m+l 


This completes the proof. □ 

Remark 14. It can be seen using induction that the value of each state in the Purgatory Duel is a rational number. 
First notice that and are the value of a matrix game with numbers in {0, 1} and hence are rational. Similarly, 

using induction in i, we see that for j S {1, 2} the number vf is rational, since it is the value of a matrix game with 
numbers in {wq, (recall that Uq = 0 and Vq = 1 ). 
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3.3 The patience of e-optimal strategies 

In this section we consider the patience of e-optimal strategies for 0 < £ < i. First we argue that each such strategy 
for player 2 is totally mixed on one side. 

Lemma 15. For all 0 < e < each e-optimal stationary strategy a 2 for player! is such that a 2 {vj) is totally mixed, 
for all j. 

Proof. Fix 0 < £ < i and fix some stationary strategy a 2 such that there exists j such that a 2 {v’j) is not totally mixed. 
We will show that a 2 is not £-optimal. 

Let p be such that 0 < ry < 5 — £. Let a be an action such that a 2 {vj)(a) = 0. Let cr^ be an ry-optimal 
strategy in Purgatory (not the Purgatory Duel) (with the same parameters n and m). Let cri be the strategy such that 
(i) ai{vj,){l) = 1 for each /; and (ii) ai{Vj){a) = 1; and (hi) ai{Vj) = a^{vj). Consider a play starting in Vg. 
Whenever the play is in state Vj, , for some j' j in each step there is a probability of either going back to Vg or going 
to Thus, the play either reaches or has gone back to Vg. If it reaches v^, then the next state is either Vg or 

T (i.e., cannot be reached). If the play is in vl, then there is a positive probability to reach T before going back 
to Vg, which is at least times the probability to reach _L before going back to Vg, since cti follows an p-optimal 
strategy in Purgatory. Hence, the probability to eventually reach T is at least 1 — p > ^ + £ and thus cr 2 is not 
£-optimal, since the value of Vg is ^ by Lemma|2l □ 


We now show that if we mirror an £-optimal strategy, then we get an £-optimal strategy. 

Lemma 16. For all 0 < e < ^, each e-optimal stationary strategy (T 2 faf player 2 in the Purgatory Duel, is such that 
the mirror strategy is e-optimal for player 1. 


Proof Fix 0 < £ < ^ and let a 2 be some £-optimal stationary strategy for player 2. Also, let cti = be the mirror 
strategy. 

By Lemma[T5]the strategy CT 2 is such that (J 2 {vj) is totally mixed, for all j. We can then apply Lemma|9]and get 
that either T or _L is reached with probability 1. Hence, since 172 is £-optimal we reach _L with probability at least 
1 — val(z;) — £ starting in v against all strategies for player 1, for each v. It is clear that any play P of CT 2 against any 
given strategy cr^ for player 1 starting in v corresponds, by symmetry, to a play P' of against ai starting in f{v), 
where 

(vg ifv = Vg 


f{v) 


Vj if = Vj 

_L if u = T 

T ifu = ± , 


such that in round i we have that Pi = f{P[) and the plays are equally likely. Thus, the probability to reach /(-L) = T, 
starting in state f{v), for each v is at least 1 — val(u) — e = val(/(u)) — e, where the equality follows from Lemma fT2] 
Hence, cti is £-optimal for player 1. □ 


Next we give a dehnition and a lemma, which is similar to Lemma 6 in ll25l . The purpose of the lemma is to 
identify certain cases where one can change the transition function of an MDP in a specihc way and obtain a new 
MDP with larger values. We cannot simply obtain the result from Lemma 6 in ll25l . since the direction is opposite 
(i.e.. Lemma 6 in ll25l considers some cases where one can change the transition function and obtain a new MDP with 
smaller values) and our lemma is also for a slightly more general class of MDPs. 

Definition 17. Let G be an MDP with safety objectives. A replacement set is a set of triples of states, actions and 
distributions over the states Q = {(si, oi, (5i),..., (s^, ai, (5^)}. Given the replacement set Q, the MDP G[Q] is an 
MDP over the same states as G and with the same set of safe states, but where the transition function 6' is 


6 '{s, a) 


Si if s = Si and a = ai for some i 

S(s,a) otherwise 
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Lemma 18. Let G be an MDP with safety objectives. Consider some replacement set 

Q — 5 1 ) 

such that for all t and i we have that 

^((5(si,aO(s) • < ^(5i(s) • w‘) . 

seS ses 


Let v' be the value vector for G[Q\ with finite horizon t. (1) For all states s and time limits t we have that 


v\ < v' 


(2) For all states s, we have that 

val(G, s) < val(G[(5], s) . 

Proof We first present the proof of first item. We will show, using induction in f, that v\ < v'\ for all s. Let 5' be the 
transition function for G\Q]. 

Base case, f = 0: Consider some state s. Clearly we have that v\ = v' ^ because we have not changed the safe 
states. ^ ^ 

Induction case, f > 1: The induction hypothesis state that vl~^ < v'^ for all s. Consider some state s. Consider 
any action a' such that there is an i such that s = Si and a = ai. We have that 

s' s' 

by definition for such a' (the statement is true for all time limits and thus also for t — V). For all other actions a" we 
have that 

s' s' 

since (5(s, a") = Sfs, a"). Hence, 


We then have, using the recursive definition of that 

vl = min^(5(s,a)(s') 

a * ^ 

s' 

< min^((5'(s,a)(s') 

a * ^ 

s' 

< min^(5'(s,a)(s') 



where we just argued the first inequality; and the second inequality comes from the induction hypothesis and that each 
factor is positive. (Note that the optimal strategy for player 2 in a matrix game A® of 1 row is to pick one of the 
columns with the smallest entry with probability 1 and thus u* = val(A®[u*“^]) = mina 

similarly for u'*). This completes the proof of the hrst item. The second item follows from the hrst item and since 
the value of a time limited game goes to the value of the game without the time limit as the time limit grows to oo, as 
shown by ifThll . □ 

We next show that for player 1, the patience of e-optimal strategies is high. 
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Lemma 19. For all 0 < s < ^, each e-optimal stationary strategy cti for player 1 in the Purgatory Duel has patience 
at least 2"* ' For N = 5 the patience is 

Proof. Consider some £-optimal stationary strategy cti for player 1 in the Purgatory Duel. Fixing cti for player 1 in the 
Purgatory Duel we obtain an MDP G' for player 2. Let be the value vector for G' with finite horizon (time-limit) t 
and let 5 be the transition function for G'. For each i, let 

(S{vl,i)is) ifvsj^s^F 

S,{s) = 5(^2, i)(_L) -p 5(^2ifu^ = s 

[o if_L = s 

(Note that 6 i is the same probability distribution as 5{v‘^, i), except that the probability mass on L is moved to Vg.) 
Consider the replacement set Q = {{v^, 1, (5i),..., {v'^, m, 5m)} and the MDP G'[Q]. We have for all t and i that 

ses sgs 


because 

= 0 < 

for all t and the only difference between 5(^2 , i) and 5i is that the probability mass on _L is moved to Vg- We then get 
from Lemma fr8} 2) that val(G", Vg) < val(G'[Q], Ug). Let a 2 be an optimal positional strategy in G'\Q]. It is easy to 
see that CT 2 plays action 1 in Vj for all j, because the best player 2 can hope for is to get back to Vg since _L cannot be 
reached from Vj in G'[Q] for any j and if he plays some action which is not 1, then there is a positive probability that 
T will be reached in one step. Thus, the MDP G' [Q] corresponds to the MDP one gets by fixing the strategy a[ where 
a''i{vi) = cri{v}) for player 1 in Purgatory. But the probability to reach T in G'\Q] is at least | — e and hence cr} is 
+ £)-optimal in Purgatory (note that this is Purgatory and not Purgatory Duel). As shown by ll20l any such strategy 

Q.{n) 

requires patience 2"* . Thus, any £-optimal stationary strategy for player 1 in the Purgatory Duel requires patience 

It was shown by EOl that the patience of £-optimal strategies for Purgatory with n = 1 Purgatory state is 
and thus similarly for the Purgatory Duel with N = 5. □ 

We are now ready to prove the main theorem of this section. 

Theorem 20. For all 0 < e < ^, every e-optimal stationary strategy, for either player, in the Purgatory Duel ( that has 
W = 2n + 3 states and at most m actions for each player at all states) has patience 2"* ^ \ For N = b the patience 
is 


Proof. The statement for strategies for player 1 follows from Lemma [T^ By Lemma[T6l for each £-optimal strategy 
for player 2, there is an £-optimal strategy for player 1 (i.e., the mirror strategy) with the same patience. Thus the 
result follows for strategies for player 2. □ 

4 Zero-sum Concurrent Stochastic Games: Patience Lower Bound for 
Three States 

In this section we show that the patience of all £-optimal strategies, for all 0 < £ < i, for both players in a concuri'ent 
reachability game G with three states of which two are absorbing, and the non-absorbing state has m actions for each 
player, can be as large as The proof consists of two phases, first we show the lower bound in a game with at 

most 771,2 actions for each player; and second, we show that all but 2m — 1 actions can be removed for both players in 
the game without changing the patience. 

The first game, the 3-state Purgatory Duel, is intuitively speaking the Purgatory Duel for N = 5, where we replace 
the states u}, vf and Vg with a state u} while in essence keeping the same set of £-optimal strategies. The idea is to 
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ensure that one step in the 3-state Purgatory Duel cotTesponds to two steps in the Purgatory Duel with N — 5, hy 
having the players pick all the actions they might use in the next two steps at once. The game is formally defined as 
follows: 

The 3-state Purgatory Duel consists of = 3 states, named w', T' and _L' respectively. The states T' and _L' are 
absorbing. The state u' is such that 

I 1 ^ ^ W • 

Also, let 6 ' be the transition function for the Purgatory Duel with N = 5. Let p be the function that given a state in 
{us, -L, T} in the Purgatory Duel for i = 1 outputs the primed state (which is then a state in the 3-state Purgatory Duel). 
Recall that U(s, s') is the uniform distribution over s and s'. Observe that the deterministic distributions S'{vl, ai, 02 ) 
and S'{vi,ai,a 2 ) are in {us, T, _L} for all oi and 02 . For each pair of actions (a}, a^) € A^, and ( 02 , a^) € Af^, in 
the 3-state Purgatory Duel, we have that 

To make the game easier to understand on its own, we now give a more elaborate description of the transition function 
5 without using the transition function for the Purgatory Duel. To make the pattern as clear as possible we write U (s, s) 
instead of s for all s. 




u(T',r) 

if 

al 

> 

al 

and 

al 

> 

al 

U(T',±') 

if 

a\ 

> 

al 

and 

al 

= 

al 


if 

a\ 

> 

al 

and 

al 

< 

al 

u(T',r) 

if 

a\ 

= 

al 

and 

al 

> 

al 

U(T',±') 

if 

a\ 

= 

al 

and 

al 

= 

al 

U(T',uD 

if 

a\ 

= 

al 

and 

al 

< 

al 

uK,r) 

if 

a\ 

< 

al 

and 

al 

> 

al 

UK,±') 

if 

a\ 

< 

al 

and 

al 

= 

al 


if 

a\ 

< 

al 

and 

al 

< 

al 


Furthermore, = {T'}. We will use for strategies in the 3-state Purgatory Duel to distinguish them from strategies 
in the Purgatory Duel. There is an illustration of the Purgatory Duel with N = 5 and m = 2 in Figure |3] and the 
corresponding 3-state Purgatory Duel in Figure|4] 

Given a strategy for player i in the 3-state Purgatory Duel we define the strategy ai in the Purgatory Duel with 
N = 5 which is the projection of and vice versa (note that the other direction maps to a set of strategies). 

Definition 21. Given a strategy ti for player i in the 3-state Purgatory Duel, let ai' be the stationary strategy for 
player i in the Purgatory Duel with N = 5 where 

al 

and 

al 

Also, for any stationary strategy ai in the Purgatory Duel with N = 5, let be the set of stationary strategies in the 
3-state Purgatory Duel such that Ti G Tf' implies that of = ai. 

Lemma 22. Consider any e > 0. Let G be the Purgatory Duel with N = 5 and G' be the 3-state Purgatory Duel. For 
any e-optimal stationary strategy Ti for player i in G', we have that aj* is e-optimal starting in Vs in G. Similarly, for 
any e-optimal stationary strategy ai in G starting in Vg each strategy in ff' is e-optimal in G'. Also, val(u') = i. 
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Figure 3: An illustration of the Purgatory Duel with N = 5 and m = 2. The two dashed edge have probability i each. 



Figure 4: An illustration of the 3-state Purgatory Duel m = 2. The non-dashed edges have probability i each. The 
order of the actions is (1,1), (1,2), (2,1), (2, 2). The actions (i.e., (2, 2) for player 1 and (1,1) for player 2) with 
white background cannot be played in a restricted strategy. 
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Proof. Consider some pair of strategies and al' for player i in G' and G, respectively. Fixing and as the 
strategy for player i we get two MDPs H' and H, respectively. We will argue that val(iF', u') = val(iF, Ug). Let u'* 
and be the vector of values for the value iteration algorithm in iteration t when run on H' and H respectively (i.e., 
the values of H' and H with time limit t). We have that = vf, by definition of the value-iteration algorithm and 
the transition function in the 3-state Purgatory Duel. Hence, since and u'*, converges to the value of state Vg and 
v'g in H and H' respectively, they have the same value. We know that the value of Vg is ^ and thus that is also the 
value of u'. □ 

Corollary 23. The patience of e-optimal stationary strategies for both players, for 0 < e < ^, in the 3-state Purgatory 
Duel is at least where m? is the number of actions in state Vg. 

Proof The patience of e-optimal strategies, for 0 < e < ^, in the Purgatory Duel with iV = 5 is from 

Theoreml20l Thus, by Lemma|22] the patience of the 3-state Purgatory Duel is □ 


The restricted 3-state Purgatory Duel. The above corollary only shows that the for the 3-state Purgatory Duel, in 
which one state have mf actions and others have 1, the patience is at least 2^^^\ We now show how to decrease the 
number of actions from quadratic down to linear, while keeping the same patience. 

From Lemma |5] and Lemma |6] we see that for any optimal strategy cti for player 1 (resp., a 2 for player 2) in 
the Purgatory Duel with iV = 5, we have that cri(?;i)(l) > ^ and that CTi(uf)(l) > ^ (resp., a 2 ivl){m) > | 
and that a 2 {vf){m) > i). Hence, there exists an optimal strategy for player 1 in the 3-state Purgatory Duel that 
only plays actions on the form (l,ai) and {a\, 1) with positive probability. More precisely, the strategy ti where 
(1) ri('(;s)((l, al)) = ai{vf){al)-, and (2) ri(us)((ai[, 1)) = (Ti(r;i)(a}); and (3) has the remaining probability mass 
on (1,1) is optimal in the 3-state Purgatory Duel, since ap is cti. Similarly for player 2 and the actions (m, 02 ) and 
( 02 ,to). Let 

^1 = {(hJ) I * = 1 V j = 1,1 < < to} 


and 


-R 2 = {(*g’) I * = = TO, 1 < < to} . 


Observe that |i?i | = |i? 2 | = 2to — 1. We say that a strategy for player i, for each i, is restricted if the strategy uses only 
actions in Ri. The sub-matrix corresponding to the restricted 3-state Purgatory Duel for to = 2 is depicted as the grey 
sub-matrix in Figure |4] This suggests the definition of the restricted 3-state Purgatory Duel, which is like the 3-state 
Purgatory Duel, except that the strategies for the players are restricted. We next show that e-optimal strategies in the 
restricted 3-state Purgatory Duel also have high patience (note, that while this is perhaps not surprising, it does not 
follow directly from the similar result for the 3-state Purgatory Duel, since it is possible that the restriction removes the 
optimal best reply to some strategy which would otherwise not be e-optimal). The key idea of the proof is as follows; 
(i) we show that the patience of player i in the 3-state Purgatory Duel remains unchanged even if only the opponent is 
enforced to use restricted strategies; and (ii) each player has a restricted strategy that is optimal in the 3-state Purgatory 
Duel as well as in the restricted 3-state Purgatory Duel. 

Lemma 24. The value of state u' in the restricted 3-state Purgatory Duel is ^ 


Proof Each player has a restricted strategy which is optimal in the 3-state Purgatory Duel and ensures value i. Thus, 
these strategies must still be optimal in the restricted 3-state Purgatory Duel and still ensure value i. □ 

The next lemma is conceptually similar to Lemma [15] for N = 5 (however, it does not follow from Lemma [fSl 
since the strategies for player 1 are restricted here). 

Lemma 25. Let T 2 be an e-optimal stationary strategy for player 2 in the restricted 3-state Purgatory Duel, for 
0 < e < i. Then, X)™! T 2 {v'f){i,j) > 0, for each j. 


Proof Fix 0 < e < Let T 2 be a stationary strategy in the 3-state Purgatory Duel (note, we do not require that T 2 is 
restricted), such that there exists an 02 for which T 2 (w')((ai, 02 )) = 0. Let a' be smallest such 02 - 

Fix 0 < r] < ^ — e. We show that there exists a restricted stationary strategy ri for player 1, ensuring that the 
payoff is at least 1 — p > i + £. There are two cases. Either (i) a' = 1 or (ii) not. 
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In case (i), let cti (t;') be an 77 -optimal strategy for player 1 in the Purgatory with parameters (3, m). Then consider 
the strategy Ti{v'g), where ri(?;')((o, 1)) = ai{Vg){a), for each a. Observe that ri is a restricted strategy. Consider 
what happens if ti is played against T 2 : In each round i, as long as Vi = v'^, the next state is either defined by the first 
or the second component of the actions of the players. If it is defined by the second component, then the next state 
Vi+i is always u', because player I’s first component is 1 and player 2’s first component greater than 1. Consider the 
rounds where the next state is defined by the first component. In such rounds T is reached with probability (1 — p) • p, 
for some p > 0 and _L is reached with probability at most p • p, because player 1 follows an p-optimal strategy in 
Purgatory on the first component. But in expectation, in every second round the first component is used and thus T is 
reached with probability at least 1 — p, which shows that cr 2 is not e-optimal. 

In case (ii), consider the strategy n, such that ri(u')((1, a')) = 1. Observe that ti is a restricted strategy. Consider 
what happens if ri is played against T 2 : In each round i, as long as Vi = u', the next state is either defined by the 
first or the second component of the players choice. If it is defined by the first component, then the next state Vi+i is 
always u' or T, because the choice of player 1 is 1. Consider the rounds where the next state is defined by the second 
component. In each such round either T or u' is reached and T is reached with positive probability, since player 1 
plays a' > 1 and player 2 always plays something else and 1 with positive probability. But in expectation, in every 
second round the second component is used and hence T is reached with probability 1 eventually, which shows that 
(72 is not ^-optimal. □ 

We will now define how to mirror strategies in the restricted 3-state Purgatory Duel. 

Definition 26. Given a stationary strategy Ti for player i in the restricted 3-state Purgatory Duel, for either i, let tZ' 

be the stationary strategy for player i (referred to as the mirror strategy ofri) in the restricted 3-state Purgatory Duel 
where rf' (u')((ai, 02 )) = Ti(u')((a 2 , ai)) for each oi and 02 . 

We next show that each e-optimal stationary strategy for player 2 can be mirrored to an e-optimal stationary for 
player 1. The statement and the proof idea are similar to Lemma [16] but since the strategies for the players are 
restricted here, there are some differences. 

Lemma 27. For all 0 < e < each e-optimal stationary strategy T 2 for player 2 in the restricted 3-state Purgatory 
Duel is such that the mirror strategy is e-optimal for player 1 in the restricted 3-state Purgatory Duel. 

Proof Fix £, such that 0 < e < ^. Consider some e-optimal stationary strategy for player 2 in the restricted 

3-state Purgatory Duel. Let t* = be the mirror strategy for player 1 given and let T2 be an optimal best 
reply to rj'. Let ti = rp be the mirror strategy for player 1 given T 2 . Observe that eventually either T or _L is 
reached with probability 1, when playing t* against T2, by Lemma |25] and the construction of the game (since there 
is a positive probability that the second component matches in every round in which the play is in v'^). We have 
that u(z;', 71,72) < 5 -b e, since 7^ is e-optimal. This indicates that T' is reached with probability at most 5 -f e 
when playing 7i against 7 ^. Hence, by symmetry _L' is reached with probability at most ^ -b e when playing 7i 
against 72 . Thus, since _L' or T' is reached with probability 1, we have that u(u' , 7i, 72 ) > 5 — showing that 7 * is 
e-optimal. □ 

We next show that e-optimal stationary strategies for player 1 requires high (exponential) patience. The state¬ 
ment and the proof idea are similar to Lemma [T^ but since the players strategies are restricted here, there are some 
differences. 

Lemma 28. For all 0 < e < ^, each e-optimal stationary strategy ai for player 1 in the restricted 3-state Purgatory 
Duel has patience 

Proof Fix some 0 < e < ^ and some e-optimal stationary strategy cti for player 1 in the restricted 3-state Purgatory 
Duel. The restricted 3-state Purgatory Duel then turns into an MDP M for player 2 and we can apply Lemma [T8l' 2'). 
We have that p = ai{y'f){a\,a\)/2 is the probability that player 1 plays an action with second component 

and the next state is defined by the second component. Let d{a\, a^) be the probability distribution over successors 
if player 2 plays ( 01 , 02 ) in vf Observe that the play would go to _L if both players played 02 and the next state is 
defined by the second component and thus 

d(o?,02)(-L) - p > 0 . 


24 


Let 


{ d{al,al){v',)+p ifv = v'^ 
d{al,al){±)-p ifv = ± 
d{al,al)(T) if= T . 

Consider the MDP M', which is equal to M, except that it uses the distribution d'{af, a^) instead of d(of, a^)- By 
Lemma fT8l2 ) we have that 

val(M') > val(M) > - — e > ^ . 

It is clear that player 2 has an optimal positional strategy in M' that plays {a\, m) for some af (this strategy is 
restricted), since playing ( 0 ^, 02 ), for some a\ < m, just increases the probability to reach T in one step (because 
player 1 might play some action al > al and otherwise the play will go back to u'). But M' corresponds to the 
MDP obtained by playing cti in the Purgatory with N = 3 (where v'^ corresponds to ui), except that with probability 
i the play goes from u' back to u' in the restricted 3-state Purgatory Duel no matter the choice of the players. This 
difference clearly does not change the value. Hence, cti ensures payoff at least i in the Purgatory with N = 3 and 
hence has patience 2^^™) by Il20l . □ 

We are now ready for the main result of this section. 

Theorem 29. For all 0 < e < every e-optimal stationary strategy, for either player, in the restricted 3-state 
Purgatory Duel (that has three states, two of which are absorbing, and the non-absorbing state has 0{m) actions for 
each player) has patience 

Proof By Lemma 123 the statement is true for every e-optimal stationary strategy for player 1. By Lemma |27] every 
e-optimal stationary strategy for player 2 corresponds to an e-optimal stationary strategy for player 1, with the same 
patience, and thus every e-optimal stationary strategy for player 2 has patience 2^*^"*^. □ 

5 Zero-sum Concurrent Stochastic Games: Patience Upper Bound 

In this section we give upper bounds on the patience of optimal and e-optimal stationary strategies in a zero-sum 
concurrent reachability game G for the safety player. Our exposition here makes heavy use of the setup of Hansen et al. 
ED and will for that reason not be fully self-contained. We assume for concreteness that the player 1 is the reachability 
player and player 2 the safety player. 

Hansen et al. showed ETl Corollary 42] for the more general class of Everett’s recursive games ifThl that each 
player has an e-optimal stationary strategy of doubly-exponential patience. More precisely, if all probabilities have 
bit-size at most r, then each player has an e-optimal strategy of patience bounded by \ For zero-sum 

concutTent reachability games the safety player is guaranteed to have an optimal stationary strategy EQlESl- Using 
this fact one may use directly the results of Hansen et al. to show that the safety player has an optimal strategy of 

patience bounded by \ We shall below refine this latter upper bound in terms of the number of value 

classes of the game. The overall approach in deriving this is the same, namely we use the general machinery of real 
algebraic geometry and semi-algebraic geometry El to derive our bounds. In order to do this we derive a formula in 
the hrst order theory of the real numbers that uniquely defines the value of the game, and from the value of the game 
we can express the optimal strategies. The improved bound is obtained by presenting a formula where the number of 
variables depend only on the number of value classes rather than the number of states. 

Let below N denote the number of non-absorbing states, and m >2 the maximum number of actions in a state for 
either player. Assume that all probabilities are rational numbers with numerators and denominators of bit-size at most 
T, where the bit-size of a positive integer n is given by [IgnJ -f 1. We let K denote the number of value classes. We 
number the non-absorbing states 1 ,..., and assume that both players have the actions {1 ,..., m} in each of these 
states. For a non-negative integer z, define bit(z) = [Ig z~\. 

Given valuations vi,... ,vn for the non-absorbing states, we define for each state k am x m matrix game A^{v) 
letting entry {i,j) be -f where p^j = 6{k, and is the probability of a transition to a state 

where the reachability player wins, given actions i and j in state k. The value mapping operator M : 
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is given by M{v) = (val(yl^(i;),..., val(A'^(z;)))). Everett showed that the value vector of his recursive games are 
given by the unique critical vector, which in turn is dehned using the value mapping. We will instead for concur¬ 
rent reachability games use the characterization of the value vector as the coordinate-wise least fixpoint of the value 
mapping. The value vector v is thus characterized by the formula 

M{v) = V A (Vu' : M{v') = v' ^ v < v') . (1) 

Similarly to 11211 proof of Theorem 13] we obtain the following statement. 

Lemma 30. There is a quantifier free formula with N variables v that expresses M{v) = v. The formula uses at 
most N(rn + 2)4'" different polynomials, each of degree at most m -f 2 and having coefficients of bit-size at most 
2{N -f l)(m -I- 2)^ bit(TO)T. 

Now, if we instead introduce a variable for each value class, we can express M{v) = v using only K free variables, 
by identifying variables of the same value class. For w S , let v{w) G denote the vector obtained by letting 
the coordinates corresponding to value class j be assigned Wj. We thus simply express M{v{w)) = v{w) instead. 
Combining this with ([T]i we obtain the hnal formula. 

Corollary 31. There is a quantified formula with K free variables that describes whether the vector v{w) is the value 
vector of G. The formula has a single block of quantifiers over K variables. Furthermore the formula uses at most 
2N(m -f 2)4'" -f K different polynomials, each of degree at most m -\- 2 and having coefficients of bit-size at most 
2{N -\- l)(m -I- 2)^ bit(m)r. 

We shall now apply the quantifier elimination |[3] Theorem 14.16] and sampling IS] Theorem 13.11] procedures to 
the formula of Corollary [311 

First we use Theorem 14.16 of Basu, Pollack, and Roy 0 obtaining a quantiher free formula with K variables, ex¬ 
pressing that w{v) is the value of G. Next we use Theorem 13.11 of IJI to obtain a univariate representation of w such 
that v{w) is the value vector of G. That is, we obtain univariate real polynomials f,go,..., qk, where / and po co¬ 
prime, such that w = / go{t) ^ ■ ■ ■, 9K{i) / go{t))^ where t is a root of /. These polynomial are of degree 

and their coefficients have bit-size Our next task is to recover from w an optimal strategy for the safety 

player. For this we just need to select optimal strategies for the column player in each of the matrix games iv {w)). 

Such optimal strategies correspond to basic feasible solutions of standard linear programs for computing the value and 
optimal strategies of matrix games (cf. EH Lemma 3]). This means that there exists (m -f 1) x (m -f 1) matrices 
M^iw ),..., M^{w), such that ..., q^{w)) is an optimal strategy for the column player in A'^{v{w)) where 

qf{w) = det((M^(r(;))i)/ det(M^(w)), where {M^{w))i denotes the matrix obtained from M^{w) by replacing col¬ 
umn i with the (m -f l)th unit vector Cm+i- As the matrices M^{w ),..., M^(w) are obtained from the matrix games 
Ai(z;M),...,A^(u( w)), the entries are degree 1 polynomial in w and having rational coefficients with numerators 
and denominators of bit-size at most r as well. Using a simple bound on determinants ||3 Proposition 8.12], and 
substituting the expression gj{t)/gQ{t) for wj for each j, we obtain a univariate representation of {qi{w ),..., q^{w)) 
for each k given by polynomials of degree '> and their coefficients have bit-size rmP^^ Substituting the root 

t using resultants (cf. liSTl Lemma 15]) we hnally obtain the following result. 

Theorem 32. Let G be a zero-sum concurrent reachability game with N non-absorbing states, at most m >2 actions 
for each player in every non-absorbing state, and where all probabilities are rational numbers with numerators and 
denominators of bit-size at most r. Assume further that G has at most K value classes. Then there is an optimal 
strategy for the safety player where each probability is a real algebraic number, defined by a polynomial of degree 
mP^^ ) and maximum coefficient bit-size rmP^^ \ 

By a standard root separation bounds (e.g. Il38] Chapter 6, equation (5)]) we obtain a patience upper bound. 
Corollary 33. Let G be as in Theorem\3^ Then there is an optimal strategy for the safety player of patience at most 

In general the probabilities of this optimal strategy will be irrational numbers. However we may employ the 
rounding scheme as explained in Lemma 14 and Theorem 15 of Hansen, Koucky, and Miltersen li22l to obtain a 
rational e-optimal strategy. Letting e = 2~^ we may round each probability, except the largest, upwards to L = 
Ig i -f Ig Ig i -f ^ binary digits, and then rounding the largest probability down by the total amount the rest 
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0{K^) 

were rounded up. Here we use that by fixing the above strategy of patience at most 2'^™ for the safety player and 

an pure strategy for the reachability player one obtains a Markov chain where each non-zero transition probability is 

at least We thus have the following. 

Corollary 34. Let G be as in Theorem \3^ Then there is an s-optimal strategy for the safety player where each 
probability is a rational number with a common denominator of magnitude at most ^ Ig '. 

We now address the basic decision problem. Let s be a state and let A be a rational number with numerator and 
denominator of bit-size at most k, and consider the task of deciding whether V 2 {s) > A. An equivalent task is to 
decide whether V 2 {s) — A > 0. Since V 2 {s) is a real algebraic number defined by a polynomial of degree ^ and 

maximum coefficient bit-size ^ it follows that t> 2 (s) — A is a real algebraic number defined by a polynomial 

of degree ^ and maximum coefficient bit-size (k + pmP^^ \ This can be seen by subtracting A from the 

univariate representation of V 2 {s) and substituting for the root t using a resultant. By standard root separation bounds 

this means that either is V 2 {s) — A = 0 or |t; 2 (s) — P\ > V’ for some p of the form d = \ Given an 

p/2-optimal strategy tT 2 for the safety player, by fixing the strategy CT 2 we obtain an MDP for player 1, where we 
can find the value V 2 {s) of state s using linear programming, and the computed estimate ^ 2 ( 5 ) for V 2 {s) is within 
p/2 of the true value. Thus if W 2 (s) > A — p/2 we conclude that V 2 {s) > A (and similarly if W 2 (s) > A -f p/2 we 
conclude that V 2 (s) > A). Now, if we fix AT to be a constant and consider the promise problem that G has at most K 
value classes, then a rational p/2-optimal strategy 172 exists with numerators and denominators of polynomial bit-size 
by Corollary!^ Now, by simply guessing non-deterministically the strategy CT 2 and verifying as above we have the 
following result. 

Theorem 35. For a fixed constant K, the promise problem of deciding whether vi{s) > X given a zero-sum concurrent 
stochastic game with at most K value classes is in CONP if player 1 has reachability objective and in NP if player 1 
has safety objective. 

Note that interestingly it does not follow similarly that the promise problem is in (coNP n NP), because the games 
are not symmetric. 

Remark 36 (Complexity of approximation for constant value classes). As a direct consequence we have that for a 

NP 

game G promised to have at most K value classes, the value of a state can be approximated in FP . This improves 

iqp 

on the FNP bound of Frederiksen and Miltersen M8V (that holds in general with no restriction on the number of 

value classes). 

6 Non-Zero-sum Concurrent Stochastic Games: Bounds on Patience and 
Roundedness 

In this section we consider non-zero-sum concurrent stochastic games where each player has either a reachability or a 
safety objective. We first present a remark on the lower bound in the presence of even a single player with reachability 
objective, and then for the rest of the section focus on non-zero-sum games where all players have safety objectives. 

Remark 37. In non-zero-sum concurrent stochastic games, with at least two players, even if there is one player with 
reachability objectives, then at least doubly-exponential patience is required for e-Nash equilibrium strategies. We 
have the property if k = 2 and one player is a reachability player and the other is a safety player, from Section [O] 
It is also easy to see that Lemma |9] together with Lemma [75] imply that if player 1 is identified with the objective 
(Reach, {T}) and player 2 is identified with the objective (Reach, {-L}) and they are playing the Purgatory Duel, then 
each strategy profile a, that forms a e-Nash equilibrium, for any 0 < e < ^, in the Purgatory Duel, has patience 

2™ * \ This is because player 2 has a harder objective (a subset of the plays satisfies it) than in Section UT^ but can 
still ensure the same payoff (by using an optimal strategy for player 2 in the concurrent reachability variant, which 
ensures that _L is reached with probability at least i). In this case, we say that a strategy is optimal (resp., e-optimal) 
for a player, if it is optimal (resp., e-optimal) for the corresponding player in the concurrent reachability version. 
It is clear that only if both strategies are optimal (resp., e-optimal), then the strategies forms a Nash equilibrium 
( resp., e-Nash equilibrium). Thus the doubly-exponential lower bound follows even for non-zero-sum games with two 
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reachability players. The key idea to extend to more players, of which at least one is a reachability player, is as follows: 
Consider some reachability player i. The game for which the lower bound holds can be described as follows. First 
player i picks another player j and they then proceed to play the Purgatory Duel with parameters n, m against each 
other. This can be captured by a game with k{2n + 1) + 3 states, where each matrix has size at most max(m, k). 
Each player must then use doubly-expoential patience in every strategy profile that forms an e-Nash equilibrium, for 
sufficently small e > 0. First consider a player j that is different from i, and a strategy for player j with low patience. 
It follows that player i would then simply play against player j and win with good probability. Second, consider a 
strategy for player i with low patience and there are two cases. Either player i gets a payoff close to ^ or not. If 
he gets a payoff close to then the player he is most likely to play against can deviate to an optimal strategy and 
increase his payoff by an amount close to which player i loses. On the other hand, if player i gets a payoff far 
from then he can deviate to an optimal strategy and then he gets payoff 

The rest of the section is devoted to non-zero-sum concurrent stochastic games with safety objectives for all 
players, and first we establish an exponential upper bound on patience and then an exponential lower bound for e- 
Nash equilibrium strategies, for £ > 0. 

6.1 Exponential upper bound on roundedness 

In this section we consider non-zero-sum concurrent safety games, with k >2 players, and such games are also called 
stay-in-a-set games, by ll^ . We will argue that, for all 0 < e < |, in any such game, there exists a strategy profile a 
that forms an £-Nash equilibrium and have roundedness at most 


—32 ■ k'^ ■ ln(e) • n ■ (^min) ” • w 
£ 

Note that the roundedness is only exponential, as compared to the doubly-exponential patience when there is at least 
one reachability player (Remark ITtIi. Note that the bound is polynomial in m and fc; and also polynomial in n if 

^min — 1- 

Players already lost, and all winners. For a prefix of a play , for a starting state s, play Pg and length £', let 
L(Pf ) be the set of players that have not lost already in Pf (note that for each i, player i has lost in a play prefix if a 
state not in S'^ has been visited in the prefix). Let P^ be some prefix of a play and we define W (Pf ) as the event that 
each player in L{Pg ) wins with probability 1. 

Player-stationary strategies. As shown by ll^ . there exists a strategy profile a = (ciji that forms a Nash equi¬ 
librium. They show that the strategy ai, for any player i, in the witness Nash equilibrium strategy profile has the 
following properties: For each set of players If and state s, there exists a probability distribution tTi(n, s), such that 
for each prefix of a play Pf , play Pg and length £', if P^ ends in s', we have that cniPg ) = ai{L{P^ ), s') (i.e., the 
strategy only depends on the players who have not lost yet and the current state). Also, there exists some positional 
strategy ct', such that CTi(n, s) = cr'(s), for alH ^ If (i.e., players who have lost already play some fixed positional 
strategy). This allows them to only consider the sub-game G'^, which is the game in which each player i not in If plays 
cr'. Also, if there is a strategy profile which ensures that each player in If wins with probability 1 if the play starts in s 
of G^, then the probability distribution s) is pur^l and it ensures that the players in If wins with probability 1. 
We call strategies with these properties player-stationary strategies. 

The real number e and the length £. In the remainder of this section, fix 0 < £ < |; and fix the length £, such that 

£= -n-k- ln(£/(4fc)) • (^min)"” • 


We will, in Lemma[^ argue that any player-stationary strategy is such that with probability 1 — £ no player loses after 
£ steps. Also several lemmas in this section will use £ and e. 

^it is not explicitly mentioned in (33) that the distributions are pure, but it follows from the fact that if all players can ensure their objectives with 
probability 1, then there exists a positional strategy profile ensuring so, by just considering an MDP (with all players together) with a conjunction 
of safety objectives 
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The event E{Pg ). Given a play Pg, starting in state s for some s and any let E(Pg ) be the event that either the 
event {L{Pg ) C L{Pg “^)) (i.e., some player lost at the f'-th step) or the event W (Pf ) (i.e., the remaining players 
win with probability 1) happens. In ll^ 2.1 Lemma] they showQ; 

Lemma 38. Fix a player-stationary strategy profile a. Let T > 0 denote a round (or a step of plays). Let be the 
set of plays, where for all plays Pg in pP®, either the remaining players win with probability 1 in round T (i.e., the 
event W (Pj') happens) or some player loses in round T (i.e., the event L{Pj) C L(Pj~^) happens). For a constant 
c and length P, let pcy = Pro-[3T : P < T < P -\- cn A Ps € denote the probability that event happens 
for some T between P and P + cm. Then, for all constants c and length P, we have that 

yc,l' > 1 - (1 - . 

Note that T above depends on the play P,. It is straightforward that players can lose at most k times in any play 
Ps, simply because there are at most k players, and if the remaining players win with probability 1 in round T, then 
they also win with probability 1 in round T + 1, by construction of a. 

Proof overview. Our proof will proceed as follows. Consider the game, while the players play some player-stationary 
strategy prohle that forms a Nash equilibria. First, we show that it is unlikely (low-probability event) that the players do 
not play positional (like they do if the event W (Pf ) has happened) after some exponential number of steps. Second, 
we show that if we change each of the probabilities used by an exponentially small amount as compared to the Nash 
equilibria, then it is unlikely that that there will be a large difference in the first exponentially many steps. This allows 
us to round the probabilities to exponentially small probabilities while the players only lose little. 

Lemma 39. Fix some player-stationary strategy profile a. Consider the set P of plays Pg, under a, such that W (Pf) 
does not happen. Then, the probability PrCT[P] is less than e/4. 

Proof. Fix 0 < £ < i and a player-stationary strategy prohle cr. Let c = — ln(£/(4fc)) • ((Jmin)”" > 1- We will argue 
that the event P(P/ ) happens at least k times with probability at least 1 — e/4 over c ■ n ■ k = £ steps. 

We consider two cases, either 6min = 1 or 0 < i^min < 1- If i^min = 1. the event 31 < T < n : P(P/ 
always happens (otherwise, in case it did not in some play, then a deterministic cycle satisfying the safety objectives 
of all players who have not lost yet is executed, and then the players could win by playing whatever they did the last 
time they were in a given state). If 0 < (5niin < 1, we see that c > c' = ^ since 1 -f a; < and that 

31 < T < c' ■ n : E(P^ happens with probability at least 1 — e/(4fc) by Lemma [38l In either case, we have that 
the event 31 < T < c - n : E(P^ happens with probability at least 1 — e/(4fc). 

Next, split the plays up in epochs of length c ■ n each, and we get that the event E(Pj) happens at least once for 
T ranging over the steps of an epoch with probability at least 1 — e/(4fc) and hence happens at least once in each of 
the hrst k epochs with probability at least 1 — e/4 using union bound. At that point the remaining players win with 
probability 1. The hrst k epochs have length c ■ k ■ n = £ and the lemma follows. □ 

We use the above lemma to show that any strategy prohle close to a Nash equilibrium ensures payoffs close to that 
equilibrium. To do so, we use coupling (similar to HD). 

Variation distance. The variation distance is a measure of the similarity between two distributions. Given a hnite set 
Z, and two distributions di and d 2 over Z, the variation distance of the distributions is 

var(di, da) = ^ ■ X! “ ^2(z)| . 

zGZ 

We will extend the notion of variation distances to strategies as follows; Given two strategies ai and cr' for player i the 
variation distance between the strategies is 

var(cri,cr') = supvar(cri(P/),cr'(P/)) ; 
p‘ 

i.e., it is the supremum over the variation distance of the distributions used by the strategies for hnite-prehxes of plays. 

®they do not explicitly show that the constant is 1 — (^min)". bat it follows easily from an inspection of the proof 
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Coupling and coupling lemma. Given a pair of distributions, a coupling is a probability distribution over the joint 
set of possible outcomes. Let Z be a finite set. For distributions di and ^2 over the finite set Z, a coupling w is a 
distribution over Z x Z, such that for a\\ z & Z we have Yhz'^z ^') ~ di{z) and also for all z' & Z we have 
SzGZ ~ d, 2 {z'). One of the most important properties of coupling is the coupling lemma lU of which we 

only mention and use the second part: 

• (Coupling lemma). For a pair of distributions di and d 2 , there exists a coupling w of di and ^ 2 , such that for a 
random variable {X, Y) from the distribution w, we have that var((ii, (( 2 ) = Pr[X ^ Y\. 

Smaller support. Fix a pair of strategies cji and cr' for player i for some i. We say that cr' has smaller support than 
Gi, if for all Pg we have that 

Supp(cr'(Pi)) C Supp(cr,(Pi)) . 

Lemma 40. Let a = and a' = (cr'ji be player-stationary strategy profiles, such that 

s Fin ■ 

and such that cr' has smaller support than for all i. Then a' is such that 

u{G, s, cr', i) G [u{G, s, a, i) — e/2, u(G, s, cr, i) + e/2] 


for each player i and state s. 


Proof. Fix a and cr' according to the lemma statement. For any prefix of a play P/ , for any state s and length P and 
player i, we have that var(CTi(P/ ), cr'(P/ )) < and thus, we can create a coupling uj = “,!/“) between 

the two distributions and a'fiPf), i.e., X^‘ ~ ai{Pf) and Yf‘ ~ a'fiPf) is such that Pr[xf'’ ^ Yj^‘ ] < 


pP 

-ppp. Then, consider some state s and consider a play Pg, picked using the random variables X^ ‘ , and a play Qs, 

picked using the random variables Yf" (where, if the players uses the same action in P/ and Ql , then the next state 
is also the same, using an implicit coupling). Then according to Lemma [39] the probability that VF(P/) occurs is at 
least 1 — e/4. In that case, we are interested in the probability that Qg = Ps- Observe that we just need to ensure 
that P/ and Ql are the same, since at that point the players play according to the same positional strategy, because 
of the smaller support. For each £" < i, if the first £" steps match, then the next step match with probability at least 
1 — ■ k, since each of the k players has a probability of to differ in the two plays. Hence, all £ steps match 

with probability at least 1 — p^ ■ £■ k = 1 — e/4. Hence, with probability at least 1 — e/2 we have that Pg equals Qg 
and thus, especially, the payoff for each player must be the same in that case. But observe that Pg is distributed like 
plays under a and Qg is distributed like plays under a' and the statement follows. □ 


We will next show that we only need to consider deviations to player-stationary strategies for the purpose of 
player-stationary equilibria. 

Lemma 41. For all player-stationary strategy profiles a and each player i, there exists a pure player-stationary 
strategy cr' for player i maximizing u{G, s, cr[cr'], i). 

Proof Observe first that it does not matter what player i does if he has already lost, and we can consider him to play 
some fixed positional strategy in that case. Also, when the remaining players play according to cr, we can view the 
game as being an MDP, in the games G^. The objective of player i is then to reach a sub-game of G'^ and a state 
in that sub-game, from which he cannot lose. But it is well-known that such reachability objectives have positional 
optimal strategies in MDPs. Hence, this strategy forms a pure player-stationary strategy in the original game. □ 

We will use Lemma 3 from mil. The proof only appears in lITOll . where the lemma is Lemma 4. 

Lemma 42. (Lemma 3, Hill ). Let Z be a set of size £. Let di be some distribution over Z and let q > £ be some 
integer. Then there exists some distribution d 2 , such that for each z G Z, there exists an integer p such that d 2 (z) = ^ 
and such that \di{z) — d 2 {z)\ < 
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We are now ready to show the main theorem of this section. 

Theorem 43. For all concurrent stochastic games with all k safety players, for all 0 < e < there exists a player¬ 
stationary strategy profile a that forms an e-Nash equilibrium and has roundedness at most 

An - ■ m - ■ ln(4fc/£) • ((5min)~” • 

Proof Fix some player-stationary strategy profile a that forms a Nash-equilibrium and some 0 < e < |; and let 

g := -n - k ■ ln(£/(4fc)) ■ (5min)“"' • 


Consider some distribution di over some set Z. Observe that for each distribution ^2 with smaller support than 
di and such that \di{z) — d 2 {z)\ < for each z G Supp((ii), we have var((ii,d 2 ) < Then, applying 

Lemmal42l for q — and Z = Supp(d), to each probability distribution d defining tr, we see that there exists a 

player-stationary strategy profile a' = {a'fii, such that (1) 


, ,, m £ 

var < — = ; 

q I-k - A 


and (2) cr' has smaller support than ap, and (3) a'fiPg) is a fraction with denominator q. Observe that the strategy has 
roundedness q. 

We now argue that cr' is an £-Nash equilibrium. Consider some player i and a player-stationary strategy cr" 
maximizing the probability that player i wins when the remaining players play according to cr', which is known to 
exists by Lemmal4Tl From Lemma|40l we have that 


m(G, s, <7[a”]fi) > u{G, s, cr'[cr''], i) - £/2 


and 


u{G,s,afi) <u{G,s,(j',i)e/2 . 


Thus, u{G, s, cr', i) > u{G, s, cr'[cr''], i) — e. This completes the proof. 


□ 


Remark 44 (Finding an £-Nash equilibria in TFN P). We explain how the results of this section imply that for non-zero- 
sum concurrent stochastic games with safety objectives for all players, if the number k of players is only a constant or 
logarithmic, then we can compute an e-Nash equilibria in TFNP, where e > 0 is given in binary as part of the input. 
Note that there is a polynomial-size witness (to guess) for a stationary strategy with exponential roundedness. Observe 
that a player-stationary strategy for a player is defined by 2^~^ -\- 1 stationary strategies, one used in case that the 
respective player has lost, and one for each subset of other players. Thus, we can guess polynomial-size witnesses of 
k player-stationary strategies with exponential roundedness, given that the number of players is at most logarithmic 
in the size of the input. Hence, according to Theorem^^ we can guess a candidate strategy profile a that forms an 
e-Nash equilibrium in non-detenninisticpolynomial time. For each player i, constructing the (polynomial-sized) MDP 
described in the proof of Lemma \4l\ and then solving it using linear programming gives us the payoff of playing the 
strategy maximizing the value for player i while the remaining players follows cr. If, for each player i, the payoff only 
differs at most e from what achieved by player i when all players follows a, then the strategy profile a is an e-Nash 
equilibrium. It follows that the approximation of some e-Nash equilibria can be achieved in TFNP, given that the 
number of players is at most logarithmic. 


6.2 Exponential lower bound on patience 

In this section, we show that patience is required, for each strategy profile that forms an £-Nash 

equilibrium, for any 0 < £ < g, in a family of games | c G N A Jmin < 6“^} with two safety players. 

Game family For a fixed number c > 1 and 0 < Jmin < 6“^, the game G/p'" is defined as follows; There 

are n = 4 • c -I- 3 states, namely, S = {us, ui, U 2 , T, _L} U {Vj \ j G {1, 2} A £ € {1,..., 2 • c — 1}}. For player i in 
state Vj, for j = 1,2, there are two actions, called of ^ and respectively. For each other state s and each player i, 
there is a single action, a. For simplicity, for each pair of states s, s' we write d(s, s') for the probability distribution. 
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Figure 5; An illustration of the game The probabilities are as follows: The probability of each dashed edge 

is 1 — (Jinin; and the probability of each dotted edge is Synin', and the probability of each solid edge is 1. The only 
exception is the edges from Vs, where the probability is written on each edge (it is ^ in each case). 

where d{s, s'){s) = 1 — Jmin and d{s, s'){s') = i5min- Also, we define uj as T and as _L. The states _L and T are 
absorbing. The state Vg is such thaO S(vs,a, a) = U(ui, W 2 )- For each j € {1, 2}, the transition function of state Vj is 

(divs.vf^) if£ = £' 

) = < d{vs,v^‘' iff < £' 

Us iff>f' 

3 

For each other state uj, the transition function is a, a) = d{vs^ The objective of player 1 is (Safety, S \ 

{_L}) and the objective of player 2 is (Safety, S \ {T }). See Figure|5]for an illustration of G^™”. 

Near-zero-sum property. Observe that either _L or T is reached with probability 1 (and once T or _L is reached, the 
game stays there). The reasoning is as follows: there is a probability of at least (^min)^'^ to reach either T or _L within 
the next 2c + 1 steps from any state. If the current state is Vg, then the next state is either vi or V 2 , and from vi or V 2 
through Vj for each £ from 1 to 2c — 1, for some j, either T or _L is reached, and each of the steps from vi or V 2 onward 
happens with probability at least ^min. no matter the choice of the players. Hence, the game is in essence zero-sum, 
since with probability 1 precisely one player wins. 

Proof overview. Our proof has two parts. We show that there is a strategy for player i, for each i, that ensures that 
against all strategies for the other player, the payoff is at least i for player i. Also, we show that for each strategy of 
player i with patience at most (5min)~^^^ there is a strategy for the other player such that the payoff is less than i 
for player i. This then allows us to show that no strategy profile that forms a i-Nash equilibrium has patience less 
than (<5min)“^/^ °- 

Lemma 45. For each i, player i has a strategy Ui such that 

v[\iu{G,Vg,ai,a2,i) = • 

z 

^recall that U(s, s') is the uniform distribution over s and s' 
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Proof. Consider the stationary strategy cti, where 


ai{vi){al'^) = ai{v2)iaf‘^) = 


2.C 


1 + (<5min) 


2 + ((5niin) + (<5min)‘‘ 


and 


CTi(vi)(aJ’^) = CTi(v2)(ai’^) = 


1 + (<5min)^ 


2 + ((5niin) + (l^min)'^ 


Observe that fixing cti as the strategy for player 1, the game turns into an MDP for player 2. Such games have a 
positional strategy ensuring that the payoff for player 2 is as large as possible. Going through all four candidates for 
CT 2 , one can see that max^a u{G, Vg, ui, a 2 , 2) = 5 . Because of the near-zero-sum property, this minimizes the payoff 
for player 1 (since u(G, Vs,cti,CT 2 , 1)-f u(G,Us,cti,CT 2,2) = 1), which is then infCT 2 m(G, u®, cti, ( 72 ,1) = 5 - The 
strategy for player 2 follows from cti and the symmetry of the game. □ 


We next argue that if player i uses a low-patience strategy, then the opponent can ensure low payoff for player i. 
Lemma 46. Let Oi be a strategy for player i with patience at most Then there exists a pure strategy aq 

such that u{G, Vg, ai, a 2 ,i) > 1 “ g- 

Proof Consider first player 1 (the argument for player 2 follows from symmetry). Let cti be some strategy with 
patience at most 

The pure strategy (T 2 is defined given <Ji as follows. For plays Pf ending in state vi or V 2 we have that 


<^2 (Pi) 


a" ifai(Pi)(af) >0 
ifai(Pi)=af . 


To argue that u(G, (Ti, (T 2 ,2) > 1 — i, we consider a play P^^ picked according to (cti, CT 2 ), such that either _L or T 
is eventually reached. This is true with probability 1. Consider the last round f, such that vi = Vj, for some j = 1,2. 
We now consider four cases: Either we have that 

1. j = 1 and CTi(Pi)(a 2 ^) > 0 or 

2. j = 1 and (7i(Pi) = or 

3. j = 2 and (7i(Pi)(a^’^) > 0 or 

4. j = 2andCTi(Pi) = a^’\ 

The probability to eventually reach L is then at least the minimum probability to eventually reach _L in each of 
the four cases. In case (2) and case (4), we see that player 2 wins with probability 1. In case (1) observe that 
from a round P where (7i(Pi )(ai’^) > 0 player 1 wins (i.e., reaches T before entering Vg again) with probability 

(1 - (()min )2/3.c) . (J 

min y < ((^min)'^ and player 2 wins (i.e., reaches _L before entering Vg again) with probability 
((5min)^^^ Hence, the probability that player 1 wins if such a round is round £ is at most 


((5min)"/3- + ^ (5„,i„)2/3-c ( < 6 ’ 

where the last inequality comes from that c > 1 and S min < 6-3. In case (3) observe that from a round £' where 
'^i(Ps )(® 2 ’^) > 0 player 1 wins (i.e., reaches T before entering Vg again) with probability at most (1 — (5min)^^3''^) • 
((5min)^° < (^min)^° and player 2 wins (i.e., reaches _L before entering Vg again) with probability at least (5i„in)^^^ ° • 
((^min)'^ = ((5min)^^^ Hence, the probability that player 1 wins if such a round is round £ is at most 


(5min)' 


((5min)3/3-^ -f (,5n,in)2 


< 


(^: 




pc 1 

_- (X . p/s ^ i 

)5/3-c “ P^rain) g 


where the last inequality comes from that c > 1 and (5min <0 3. The desired result follows. 


□ 
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We now prove the main result that no strategy with patience only (Jmin) can be a part of a i-Nash equilib¬ 
rium. 

Theorem 47. For all c G N and all 0 < Jmin < 6 “^, consider the game (that has n = Ac -\- 2> states and at 

most two actions for each player at all states). Each strategy profile a = {ai)i that forms an ^-Nash equilibrium has 
patience at least 

Proof Fix some c S N and 0 < <5min < 6 “^. The proof will be by contradiction. Consider hrst player 1 (the argument 
for player 2 follows from symmetry). Let ui be some strategy with patience at most 

Consider some strategy (J 2 for player 2. We consider two cases, either 

112 
u{G,Vs,(Ti,a2,‘f) ^ 2 6 ^ 3 

or not. If 

2 

u{G, Vs, (71,02,2) < - , 

then player 2 can play a strategy o'^, shown to exist in Lemma l46l instead and get payoff strictly above 1 — ^ 
showing that [oi, 02 ) is not an i-Nash equilibrium. On the other hand, if 

2 

U(G,Vs,01,02,2) > - , 

then u(G,Vs,0i,02,1) < 5 and player 1 can play a strategy o'^, shown to exist in Lemma l45l for which 
u(G,Vs,o'-^, 02 , 1) > Hence, (cri,cr 2 ) does not form an i-Nash equilibrium in this case either. The desired re¬ 
sult follows. □ 

Remark 48. Using ideas similar to Remark\37\we can construct a game with fc > 3 safety players in which the 
patience is at least for all strategy profiles that forms an -^-Nash equilibrium. 

1 Discussion and Conclusion 

In this section, we discuss some important features and interesting technical aspects of our results. Finally we conclude 
with some remarks. 

7.1 Important features of results 

We now highlight two important features of our results, namely, the surprising aspects and the signihcance of the 
results. 

Surprising aspects of our results. We discuss three surprising aspects of our result. 

1. The doubly-exponential lower bound on patience. For concurrent safety games, the properties of strategies re¬ 
semble that of concurrent discounted games. In both cases, (1) optimal strategies exist, (2) there exist stationary 
strategies that are optimal, and (3) locally optimal strategies (that play optimally in every state with respect 
to the matrix games with values) are optimal. The other class of concurrent games where optimal stationary 
strategies exist are concurrent ergodic mean-payoff games, however, in contrast to safety and discounted games, 
in concurrent ergodic mean-payoff games not all locally optimal strategies are optimal. However, though for 
concurrent discounted games as well for concurrent ergodic mean-payoff games, the optimal bound on the pa¬ 
tience of e-optimal stationary strategies, for e > 0 , is exponential, we show a doubly-exponential lower bound 
on patience of e-optimal strategies for concurrent safety games, for e > 0 . 

2. The lower bound example. The second surprising aspect of our result is the lower bound example itself, which 
had been elusive for safety games. The closer the lower bound example is to known examples, the greater is 
its value, as it is easier to understand, and illustrates the simplicity of our elusive example. Our example is 
obtained as follows: We consider the Purgatory games (n -I-1, m), which has two value classes, and in this game 
positional (pure memoryless) optimal strategies exist for the safety player. We simplify the game by making the 
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start state a deterministic state with one action for each player that with probability one goes to the next state. 
We call this simplified Purgatory, and strategies in simplified Purgatory corresponds to strategies in Purgatory 
{n,m). Then we consider the dual of the simplified Purgatory, which is basically a mirror of the simplified 
Purgatory, with roles of the players exchanged. In effect the dual is obtained by exchanging T and _L. Both in 
the simplified Purgatory and the dual of simplified Purgatory, there are two value classes, and positional optimal 
strategies exist for the safety player. The Puragatory duel is obtained by simply merging the start states of the 
simplified Purgatory and the dual of the simplified Puragatory, thus from the start state we go to the first state of 
the Purgatory (n, m) and the first state of the dual of Purgatory (n, m), each with probability half; see Figure|6] 
Quite surprisingly we show that this simple merge operation gives a game where each state has a different value 
(i.e., that has linear number of value classes instead of two value classes), and the patience of optimal strategies 
increases from 1 (positional) to doubly-exponential (even for e-optimal strategies) for the safety player. 

3. From reachability to safety. The third surprising aspect is that we transfer a lower bound result from concurrent 
reachability to concurrent safety games. Typically, the behavior of strategies of concurrent reachability and 
safety games are different, e.g., for reachability games optimal strategies do not exist in general, whereas they 
exist for concurrent safety games; and even in concurrent reachability games where optimal strategies exist, not 
all locally optimal strategies are optimal, whereas in concurrent safety games all locally optimal strategies are 
optimal. Yet we show that a lower bound example for concurrent reachability games can be modified to obtain a 
lower bound for concurrent safety games. Moreover, we show that the strategy complexity results with respect 
to the number of value classes in concurrent safety games is different and much more refined as compared to 
reachability games (see Table[T]|. 

Significance of our result. There are several significant aspects of our result. 

1. Roundedeness and patience. As a measure of strategy complexity there are two important notions: (a) round¬ 
edness, which is more relevant from the computational aspect; and (b) patience, which is the traditional game 
theoretic measure. The roundedness is always at least the patience, and in this work we present matching bounds 
for patience and roundedness (i.e., our upper bounds are for roundedness which are matched with lower bounds 
for patience). Thus our results present a complete picture of strategy complexity with respect to both well-known 
measures. 

2. Computational complexity. In the study of stochastic games, the most well-studied way to obtain computational 
complexity result is to explicitly guess strategies and then verify the resulting game obtained after fixing the 
strategy. The lower bound for concurrent reachability games by itself did not rule out that improved compu¬ 
tational complexity bounds can be achieved through better strategy complexity for safety games. Indeed, for 
constant number of value classes, we obtain a better complexity result due to the exponential bound on round¬ 
edness. Our doubly-exponential lower bound shows that in general the method of explicitly guessing strategies 
would require exponential space, and would not yield NP or CONP upper bounds. In other words, our re¬ 
sults establish that to obtain NP or CoNP upper bound for concurrent safety games in general completely new 
techniques are necessary. 

3. Lower bound for algorithm. One of the most well-studied algorithm for games is the strategy-iteration algorithm 
that explicitly modifies strategies. Our result shows that any natural variant of the strategy-iteration algorithm 
for the safety player which explicitly compute strategies require exponential space in the worst-case. 

4. Complexity of strategies. While the decision problem for games of whether the value is at least a threshold is 
the most fundamental question, along with values, witness (close-to-)optimal strategies are required. Our results 
present a tight bound on the complexity of strategies (which are as important as values). 

In summary, our main contributions are optimal bounds on strategy complexity, and our lower bounds have significant 
implications: it provides worst-case lower bound for a natural class of algorithms, as well rules out a traditional method 
to obtain computational complexity results. 

7.2 Interesting technical aspects 

Remark 49 (Difference of exponential bounds). In this work we present two different exponential bound on patience. 
The first for zero-sum concurrent stochastic games, and the second for non-zero-sum concurrent stochastic games with 
safety objectives for all players. However, note that the nature of the lower bounds are very different. The first lower 
bound is exponential in the number of actions, and the size of the state space is constant. In contrast, for non-zero-sum 
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concurrent stochastic games with safety objectives for all players, if the size of the state space is constant, then our 
upper bound on patience is polynomial. The second lower bound in contrast to the first lower bound is exponential in 
the number of states (and the upper bound is polynomial in m and also the number of players). 

Remark 50 (Concurrent games with deterministic transitions). We now discuss our results for concurrent games 
with deterministic transitions. It follows from the results of ^ that for zero-sum games, there is a polynomial-time 
reduction from concurrent stochastic games to concurrent games with deterministic transitions. Hence, all our lower 
bound results for zero-sum games also hold for concurrent deterministic games. Observe that this is also true for 
our lower bound on non-zero sum games with at least one reachability player, since we reduce the problem to the 
zero-sum case. However, in general for non-zero-sum games polynomial-time reductions from concurrent stochastic 
games to concurrent deterministic games are not possible. For example, for concurrent stochastic games with safety 
objectives for all players we establish an exponential lower bound on patience of strategies that constitute an 1 /6-Nash 
equilibrium, whereas in contrast, our upper bound on patience shows that if the game is deterministic (i.e., Jmin = Ij 
and e is constant, then there always exists an e-Nash equilibrium that requires only polynomial patience. 

Remark 51 (Nature of strategies for the reachability player). Another important feature of our result is as follows: 
for zero-sum concurrent stochastic games, the characterization of M9’i of e-optimal strategies as monomial strategies 
for reachability objectives, separates the description of the strategies as a part that is a function of e, and a part that is 
independent e. The previous double-exponential lower bound on patience from 12211201/ shows that the part dependent 
on e requires double-exponential patience, whereas the part that is independent only requires linear patience. A 
witness for e-optimal strategies in Purgatory (as described in for the value-1 problem for general zero-sum 
concurrent stochastic game) can be obtained as a ranking function on states and actions, such that the actions with 
rank 0 are played with uniform probability (linear patience); and an action of rank i at a state of rank j is played 
with probability roughly proportional to e^\ In contrast, since we show lower bound for optimal strategies (and 
the strategies are symmetric) in Purgatory Duel, our lower bound implies that also the part that is independent of e 
requires double-exponential patience in general (i.e., the probability description of e-optimal strategies needs to be 
doubly exponentially precise). 

7.3 Concluding remarks 

In this work, we established the strategy complexity of zero-sum and non-zero-sum concurTent games with safety and 
reachability objectives. Our most important result is the doubly-exponential lower bound on patience for e-optimal 
strategies, for e > 0, for the safety player in concurrent zero-sum games. Note that roundedness is at least patience, 
and we present upper bounds for roundedness that match our lower bound for patience, and thus we establish tight 
bounds both for roundedness and patience. Our results also imply tight bounds on “granularity” of strategies (i.e., the 
minimal difference between two probabilities). Since patience is the minimum positive probability, and some actions 
can be played with probability 0, a lower bound on patience is a lower bound on granularity, and an upper bound on 
roundedness is an upper bound on granularity. Finally, there are many interesting directions of future work. The first 
question is the complexity of the value problem for concurrent safety games. While our results show that explicitly 
guessing strategies does not yield desired complexity results, an interesting question is whether new techniques can be 
developed to show that concurrent safety games can be decided in CONP in general. A second interesting question is 
whether variants of strategy-iteration algorithm can be developed that does not explicitly modify strategies, and does 
not have worst-case exponential-space complexity. 
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